Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Leapfrog: First Look]
SIMT Architecture Delivers Double-Precision Teraflops

William Wong  |   ED Online ID #19280  |   July 10, 2008


NVidia’s T10 architecture brings double-precision floating point to the company’s massively parallel computing platform. This graphics processing unit (GPU) architecture also is used in NVidia’s consumer graphics boards. Both are supported by the Compute Unified Device Architecture (CUDA). The Tesla S1070 1U rack-mount system incorporates four of the Tesla T10 boards, each with a single chip containing 240 cores (Fig. 1). The Tesla C1060 resembles these boards, but it plugs into a wide PCI Express x16 slot.

The T10 brings a number of new features to the Telsa line (Fig. 2). It doubles performance, moving to double-precision floating point and packing 4 Gbytes onto the board. It also uses one of the largest production chips around, with 1.4 billion transistors in 240 cores that can churn out 1 teraflop.

The architecture is the same as the earlier single-precision Telsa 8/G80. It is based aroundv a single-instruction, multiple-thread (SIMT) execution model where groups of up to eight threads will execute the same instruction in a thread processing array (TPA). Cores in a TPA share fast access to 16 kbytes of TPA memory. Three TPAs are grouped into thread processing clusters (TPC), and thread contexts are collected into groups of 32 threads.

The thread dispatch unit matches active threads that will execute the same instruction to use as many cores as possible at one time. The maximum throughput is attained when all cores are active each cycle. Different groups of threads can execute their matching instruction on the same TPA in an alternating, sequential fashion. The hardware handles thread scheduling and dispatch. All of the threads of the same priority are either active or waiting for activation.

Execution is very efficient if threads remain in lock-step. This is very common when dealing with arrays. If a group of 32 threads running together hits a branch point and half take the branch, the eight-core TPA can still work nicely with the resulting sets of threads. Obviously, algorithms or data that require many individual threads running different code will not fare as well with this architecture.

Code can incorporate synchronization points where threads will come together on the same instruction. Synchronization can be important if early termination tests can be performed. Otherwise, the chip tends to run a group of threads as long as possible.

The latest implementation can have a computation and a one-way data transfer occurring at the same time. The older G80 architecture could perform only one action at a time. The off-chip interface is PCI Express (PCIe) Gen2 with 16 lanes delivering a maximum transfer rate of 102 Gbytes/s.

Data can be moved more quickly with the faster PCIe links, but most users are more impressed by the 4 Gbytes of on-board storage since most data needs to be in that memory to be used efficiently. Multiboard solutions work well if the data can be spread across the boards with minimal crossboard communication being required. CUDA hides most of the underlying hardware complexity from the programmer. It depends upon a few C annotations so the CUDA C compiler can better address the multithreading aspects of an application. The system does not handle recursion, and loops and arrays are the norm. This isn’t surprising given the original target for GPUs.

Also, CUDA provides access to cuDPP (Data Parallel Primitives) as well as a number of vector libraries that support the usual suspects, such as fast Fourier transforms (FFTs). Several third-party companies and projects provide similar libraries. For example, Tech-X’s GPULib provides hooks for Java, Python, Matlab, and IDL, allowing a wide range of applications to take advantage of NVidia’s GPU.

Still, the application space is much wider than just graphics, though 3D visualization and analysis are often high on the list. One design in the medical industry from TechniScan performs analysis for the Whole Breast Ultrasound scanner. Four Telsa T10 boards can analyze a scan in 15 minutes, compared to a much more expensive, 16-core cluster that takes three times as long to handle the same job.

The CUDA C compiler is a free but not open-source download, though many of the projects in NVidia’s CUDA Zone are open-source. The interface specifications are open. CUDA can also generate code for conventional multicore platforms, though usually with lower performance benefits than a GPU can provide.

Developers can develop applications using CUDA and run them on platforms such as NVidia’s GeForce 8 series. These are only single-precision platforms and the new Tesla boards bring more memory and cores to bear, but it can run the same applications. The current drivers from NVidia for the company’s graphics boards will all support CUDA applications and development.

Several universities already use CUDA in parallel programming classes and projects. It should be interesting to see how parallel processing grows now that many developers can tap the power in their NVidia multicore graphics boards.

WILLIAM WONG

NVIDIA
www.nvidia.com

Continue on Page 2


Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?


  • Engineers Rely On Internet For Product Info
  • Rochester Electronics Establishes New Design and Technology Group
  • Custom Sources Light Way To 22-nm IC Lithography
  • In EDA, A Year Of Mergers, Failed And Otherwise
  • Software Turns Scopes Into Vector RF Signal Analyzers
  • Couple’s $15 Million Gift Advances Rice Engineering Education
  • November 7, 2008
  • Startup Sets Sail For Speedier Spice Simulation
    1) Ten Top Design Skills For Tough Times
    (223 views today)
    2) Build A Smart Battery Charger Using A Single-Transistor Circuit
    (214 views today)
    3) Easily Convert Decimal Numbers To Their Binary And BCD Formats
    (118 views today)
    4) DC-AC inverter targets electroluminescent applications
    (84 views today)
    5) Precision DC motor speed controller
    (77 views today)
    ALL TOP 20



    POST YOUR COMMENTS HERE
    Name:

    Email:
    Your Comments:

    Enter the text from the image below


    Please refresh the page if you have trouble reading this text.

    Search Electronic Design
         
      
     
    Web Seminar
    Sponsored By:
    Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
    Speakers: 
    Date: 07/01/08
    Register: 

    Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
    Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources