CUDA_C_Programming_Guide.pdf

(3326 KB) Pobierz
CUDA C PROGRAMMING GUIDE
PG-02829-001_v7.5 | September 2015
Design Guide
CHANGES FROM VERSION 7.0
Updated
C/C++ Language Support
to:
Added new section
C++11 Language Features,
Clarified that values of const-qualified variables with builtin floating-point types
cannot be used directly in device code when the Microsoft compiler is used as
the host compiler,
Documented the extended lambda feature,
Documented that typeid, std::type_info, and dynamic_cast are only supported
in host code,
Documented the restrictions on trigraphs and digraphs,
Clarified the conditions under which layout mismatch can occur on Windows.
Updated
Table 12
to mention support of half-precision floating-point operations on
devices of compute capabilities 5.3.
Updated
Table 2
with throughput for half-precision floating-point instructions.
Added compute capability 5.3 to
Table 13.
Added the maximum number of resident grids per device to
Table 13.
Clarified the definition of
__threadfence()
in
Memory Fence Functions.
Mentioned in
Atomic Functions
that atomic functions do not act as memory fences.
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | ii
TABLE OF CONTENTS
Chapter  1.  Introduction.........................................................................................1
1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1
1.2. CUDA
®
: A General-Purpose Parallel Computing Platform and Programming Model.............4
1.3. A Scalable Programming Model.........................................................................5
1.4.  Document Structure...................................................................................... 7
Chapter  2.  Programming Model............................................................................... 9
2.1.  Kernels...................................................................................................... 9
2.2.  Thread Hierarchy........................................................................................ 10
2.3.  Memory Hierarchy....................................................................................... 12
2.4.  Heterogeneous Programming.......................................................................... 14
2.5.  Compute Capability..................................................................................... 16
Chapter 3. Programming Interface..........................................................................17
3.1.  Compilation with NVCC................................................................................ 17
3.1.1.  Compilation Workflow............................................................................. 18
3.1.1.1.  Offline Compilation.......................................................................... 18
3.1.1.2. Just-in-Time Compilation....................................................................18
3.1.2.  Binary Compatibility............................................................................... 18
3.1.3.  PTX Compatibility.................................................................................. 19
3.1.4.  Application Compatibility.........................................................................19
3.1.5.  C/C++ Compatibility............................................................................... 20
3.1.6.  64-Bit Compatibility............................................................................... 20
3.2.  CUDA C Runtime.........................................................................................20
3.2.1.  Initialization.........................................................................................21
3.2.2.  Device Memory..................................................................................... 21
3.2.3.  Shared Memory..................................................................................... 24
3.2.4.  Page-Locked Host Memory........................................................................29
3.2.4.1.  Portable Memory..............................................................................30
3.2.4.2. Write-Combining Memory....................................................................30
3.2.4.3.  Mapped Memory...............................................................................30
3.2.5. Asynchronous Concurrent Execution............................................................ 31
3.2.5.1. Concurrent Execution between Host and Device........................................ 32
3.2.5.2. Concurrent Kernel Execution............................................................... 32
3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32
3.2.5.4. Concurrent Data Transfers.................................................................. 33
3.2.5.5.  Streams......................................................................................... 33
3.2.5.6.  Events...........................................................................................37
3.2.5.7.  Synchronous Calls.............................................................................37
3.2.6.  Multi-Device System............................................................................... 38
3.2.6.1.  Device Enumeration.......................................................................... 38
3.2.6.2.  Device Selection.............................................................................. 38
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | iii
3.2.6.3. Stream and Event Behavior................................................................. 38
3.2.6.4. Peer-to-Peer Memory Access................................................................39
3.2.6.5. Peer-to-Peer Memory Copy..................................................................39
3.2.7. Unified Virtual Address Space................................................................... 40
3.2.8. Interprocess Communication..................................................................... 41
3.2.9.  Error Checking......................................................................................41
3.2.10.  Call Stack.......................................................................................... 42
3.2.11. Texture and Surface Memory................................................................... 42
3.2.11.1.  Texture Memory............................................................................. 42
3.2.11.2.  Surface Memory............................................................................. 52
3.2.11.3.  CUDA Arrays..................................................................................56
3.2.11.4.  Read/Write Coherency..................................................................... 56
3.2.12.  Graphics Interoperability........................................................................ 56
3.2.12.1. OpenGL Interoperability................................................................... 57
3.2.12.2. Direct3D Interoperability...................................................................59
3.2.12.3.  SLI Interoperability..........................................................................65
3.3.  Versioning and Compatibility.......................................................................... 66
3.4.  Compute Modes..........................................................................................67
3.5.  Mode Switches........................................................................................... 68
3.6. Tesla Compute Cluster Mode for Windows.......................................................... 68
Chapter 4. Hardware Implementation......................................................................69
4.1.  SIMT Architecture....................................................................................... 69
4.2.  Hardware Multithreading...............................................................................71
Chapter 5. Performance Guidelines........................................................................ 72
5.1. Overall Performance Optimization Strategies...................................................... 72
5.2.  Maximize Utilization.................................................................................... 72
5.2.1.  Application Level...................................................................................72
5.2.2.  Device Level........................................................................................ 73
5.2.3.  Multiprocessor Level...............................................................................73
5.2.3.1.  Occupancy Calculator........................................................................ 75
5.3. Maximize Memory Throughput........................................................................ 77
5.3.1. Data Transfer between Host and Device....................................................... 78
5.3.2.  Device Memory Accesses..........................................................................79
5.4. Maximize Instruction Throughput..................................................................... 83
5.4.1.  Arithmetic Instructions............................................................................83
5.4.2.  Control Flow Instructions......................................................................... 87
5.4.3. Synchronization Instruction.......................................................................88
Appendix  A.  CUDA-Enabled GPUs........................................................................... 89
Appendix B. C Language Extensions........................................................................ 90
B.1.  Function Type Qualifiers............................................................................... 90
B.1.1.  __device__.......................................................................................... 90
B.1.2.  __global__...........................................................................................90
B.1.3.  __host__............................................................................................. 90
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | iv
B.1.4. __noinline__ and __forceinline__............................................................... 91
B.2.  Variable Type Qualifiers................................................................................91
B.2.1.  __device__.......................................................................................... 91
B.2.2.  __constant__........................................................................................92
B.2.3.  __shared__.......................................................................................... 92
B.2.4.  __managed__....................................................................................... 93
B.2.5.  __restrict__......................................................................................... 93
B.3.  Built-in Vector Types................................................................................... 94
B.3.1. char, short, int, long, longlong, float, double................................................ 94
B.3.2.  dim3.................................................................................................. 95
B.4.  Built-in Variables........................................................................................ 96
B.4.1.  gridDim.............................................................................................. 96
B.4.2.  blockIdx..............................................................................................96
B.4.3.  blockDim.............................................................................................96
B.4.4.  threadIdx............................................................................................ 96
B.4.5.  warpSize............................................................................................. 96
B.5.  Memory Fence Functions...............................................................................96
B.6.  Synchronization Functions............................................................................. 99
B.7.  Mathematical Functions...............................................................................100
B.8.  Texture Functions...................................................................................... 100
B.8.1.  Texture Object API............................................................................... 101
B.8.1.1.  tex1Dfetch()..................................................................................101
B.8.1.2.  tex1D()........................................................................................ 101
B.8.1.3.  tex1DLod()....................................................................................101
B.8.1.4.  tex1DGrad().................................................................................. 101
B.8.1.5.  tex2D()........................................................................................ 101
B.8.1.6.  tex2DLod()....................................................................................101
B.8.1.7.  tex2DGrad().................................................................................. 102
B.8.1.8.  tex3D()........................................................................................ 102
B.8.1.9.  tex3DLod()....................................................................................102
B.8.1.10.  tex3DGrad().................................................................................102
B.8.1.11.  tex1DLayered()............................................................................. 102
B.8.1.12.  tex1DLayeredLod().........................................................................102
B.8.1.13.  tex1DLayeredGrad()....................................................................... 103
B.8.1.14.  tex2DLayered()............................................................................. 103
B.8.1.15.  tex2DLayeredLod().........................................................................103
B.8.1.16.  tex2DLayeredGrad()....................................................................... 103
B.8.1.17.  texCubemap().............................................................................. 103
B.8.1.18.  texCubemapLod().......................................................................... 103
B.8.1.19. texCubemapLayered().....................................................................104
B.8.1.20. texCubemapLayeredLod()................................................................ 104
B.8.1.21.  tex2Dgather()...............................................................................104
B.8.2.  Texture Reference API........................................................................... 105
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | v
Zgłoś jeśli naruszono regulamin