CUDA_C_Programming_Guide.pdf
(
3326 KB
)
Pobierz
CUDA C PROGRAMMING GUIDE
PG-02829-001_v7.5 | September 2015
Design Guide
CHANGES FROM VERSION 7.0
‣
Updated
C/C++ Language Support
to:
‣
‣
Added new section
C++11 Language Features,
Clarified that values of const-qualified variables with builtin floating-point types
cannot be used directly in device code when the Microsoft compiler is used as
the host compiler,
‣
Documented the extended lambda feature,
‣
Documented that typeid, std::type_info, and dynamic_cast are only supported
in host code,
‣
Documented the restrictions on trigraphs and digraphs,
‣
Clarified the conditions under which layout mismatch can occur on Windows.
Updated
Table 12
to mention support of half-precision floating-point operations on
devices of compute capabilities 5.3.
Updated
Table 2
with throughput for half-precision floating-point instructions.
Added compute capability 5.3 to
Table 13.
Added the maximum number of resident grids per device to
Table 13.
Clarified the definition of
__threadfence()
in
Memory Fence Functions.
Mentioned in
Atomic Functions
that atomic functions do not act as memory fences.
‣
‣
‣
‣
‣
‣
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | ii
TABLE OF CONTENTS
Chapter 1. Introduction.........................................................................................1
1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1
1.2. CUDA
®
: A General-Purpose Parallel Computing Platform and Programming Model.............4
1.3. A Scalable Programming Model.........................................................................5
1.4. Document Structure...................................................................................... 7
Chapter 2. Programming Model............................................................................... 9
2.1. Kernels...................................................................................................... 9
2.2. Thread Hierarchy........................................................................................ 10
2.3. Memory Hierarchy....................................................................................... 12
2.4. Heterogeneous Programming.......................................................................... 14
2.5. Compute Capability..................................................................................... 16
Chapter 3. Programming Interface..........................................................................17
3.1. Compilation with NVCC................................................................................ 17
3.1.1. Compilation Workflow............................................................................. 18
3.1.1.1. Offline Compilation.......................................................................... 18
3.1.1.2. Just-in-Time Compilation....................................................................18
3.1.2. Binary Compatibility............................................................................... 18
3.1.3. PTX Compatibility.................................................................................. 19
3.1.4. Application Compatibility.........................................................................19
3.1.5. C/C++ Compatibility............................................................................... 20
3.1.6. 64-Bit Compatibility............................................................................... 20
3.2. CUDA C Runtime.........................................................................................20
3.2.1. Initialization.........................................................................................21
3.2.2. Device Memory..................................................................................... 21
3.2.3. Shared Memory..................................................................................... 24
3.2.4. Page-Locked Host Memory........................................................................29
3.2.4.1. Portable Memory..............................................................................30
3.2.4.2. Write-Combining Memory....................................................................30
3.2.4.3. Mapped Memory...............................................................................30
3.2.5. Asynchronous Concurrent Execution............................................................ 31
3.2.5.1. Concurrent Execution between Host and Device........................................ 32
3.2.5.2. Concurrent Kernel Execution............................................................... 32
3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32
3.2.5.4. Concurrent Data Transfers.................................................................. 33
3.2.5.5. Streams......................................................................................... 33
3.2.5.6. Events...........................................................................................37
3.2.5.7. Synchronous Calls.............................................................................37
3.2.6. Multi-Device System............................................................................... 38
3.2.6.1. Device Enumeration.......................................................................... 38
3.2.6.2. Device Selection.............................................................................. 38
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | iii
3.2.6.3. Stream and Event Behavior................................................................. 38
3.2.6.4. Peer-to-Peer Memory Access................................................................39
3.2.6.5. Peer-to-Peer Memory Copy..................................................................39
3.2.7. Unified Virtual Address Space................................................................... 40
3.2.8. Interprocess Communication..................................................................... 41
3.2.9. Error Checking......................................................................................41
3.2.10. Call Stack.......................................................................................... 42
3.2.11. Texture and Surface Memory................................................................... 42
3.2.11.1. Texture Memory............................................................................. 42
3.2.11.2. Surface Memory............................................................................. 52
3.2.11.3. CUDA Arrays..................................................................................56
3.2.11.4. Read/Write Coherency..................................................................... 56
3.2.12. Graphics Interoperability........................................................................ 56
3.2.12.1. OpenGL Interoperability................................................................... 57
3.2.12.2. Direct3D Interoperability...................................................................59
3.2.12.3. SLI Interoperability..........................................................................65
3.3. Versioning and Compatibility.......................................................................... 66
3.4. Compute Modes..........................................................................................67
3.5. Mode Switches........................................................................................... 68
3.6. Tesla Compute Cluster Mode for Windows.......................................................... 68
Chapter 4. Hardware Implementation......................................................................69
4.1. SIMT Architecture....................................................................................... 69
4.2. Hardware Multithreading...............................................................................71
Chapter 5. Performance Guidelines........................................................................ 72
5.1. Overall Performance Optimization Strategies...................................................... 72
5.2. Maximize Utilization.................................................................................... 72
5.2.1. Application Level...................................................................................72
5.2.2. Device Level........................................................................................ 73
5.2.3. Multiprocessor Level...............................................................................73
5.2.3.1. Occupancy Calculator........................................................................ 75
5.3. Maximize Memory Throughput........................................................................ 77
5.3.1. Data Transfer between Host and Device....................................................... 78
5.3.2. Device Memory Accesses..........................................................................79
5.4. Maximize Instruction Throughput..................................................................... 83
5.4.1. Arithmetic Instructions............................................................................83
5.4.2. Control Flow Instructions......................................................................... 87
5.4.3. Synchronization Instruction.......................................................................88
Appendix A. CUDA-Enabled GPUs........................................................................... 89
Appendix B. C Language Extensions........................................................................ 90
B.1. Function Type Qualifiers............................................................................... 90
B.1.1. __device__.......................................................................................... 90
B.1.2. __global__...........................................................................................90
B.1.3. __host__............................................................................................. 90
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | iv
B.1.4. __noinline__ and __forceinline__............................................................... 91
B.2. Variable Type Qualifiers................................................................................91
B.2.1. __device__.......................................................................................... 91
B.2.2. __constant__........................................................................................92
B.2.3. __shared__.......................................................................................... 92
B.2.4. __managed__....................................................................................... 93
B.2.5. __restrict__......................................................................................... 93
B.3. Built-in Vector Types................................................................................... 94
B.3.1. char, short, int, long, longlong, float, double................................................ 94
B.3.2. dim3.................................................................................................. 95
B.4. Built-in Variables........................................................................................ 96
B.4.1. gridDim.............................................................................................. 96
B.4.2. blockIdx..............................................................................................96
B.4.3. blockDim.............................................................................................96
B.4.4. threadIdx............................................................................................ 96
B.4.5. warpSize............................................................................................. 96
B.5. Memory Fence Functions...............................................................................96
B.6. Synchronization Functions............................................................................. 99
B.7. Mathematical Functions...............................................................................100
B.8. Texture Functions...................................................................................... 100
B.8.1. Texture Object API............................................................................... 101
B.8.1.1. tex1Dfetch()..................................................................................101
B.8.1.2. tex1D()........................................................................................ 101
B.8.1.3. tex1DLod()....................................................................................101
B.8.1.4. tex1DGrad().................................................................................. 101
B.8.1.5. tex2D()........................................................................................ 101
B.8.1.6. tex2DLod()....................................................................................101
B.8.1.7. tex2DGrad().................................................................................. 102
B.8.1.8. tex3D()........................................................................................ 102
B.8.1.9. tex3DLod()....................................................................................102
B.8.1.10. tex3DGrad().................................................................................102
B.8.1.11. tex1DLayered()............................................................................. 102
B.8.1.12. tex1DLayeredLod().........................................................................102
B.8.1.13. tex1DLayeredGrad()....................................................................... 103
B.8.1.14. tex2DLayered()............................................................................. 103
B.8.1.15. tex2DLayeredLod().........................................................................103
B.8.1.16. tex2DLayeredGrad()....................................................................... 103
B.8.1.17. texCubemap().............................................................................. 103
B.8.1.18. texCubemapLod().......................................................................... 103
B.8.1.19. texCubemapLayered().....................................................................104
B.8.1.20. texCubemapLayeredLod()................................................................ 104
B.8.1.21. tex2Dgather()...............................................................................104
B.8.2. Texture Reference API........................................................................... 105
www.nvidia.com
CUDA C Programming Guide
PG-02829-001_v7.5 | v
Plik z chomika:
sdfg_ds
Inne pliki z tego folderu:
PROFESSIONAL_CUDA_C_Programming_2014_Cheng.pdf
(51834 KB)
GPU_Programming_in_MATLAB_2016_Ploskas.pdf
(35646 KB)
CUDA_Fortran_for_Scientists_and_Engineers_2014_Ruetsch.pdf
(10291 KB)
CUDA_Application_Design_and_Development_2011_Farber.pdf
(7085 KB)
The_CUDA_Handbook_2013_Wilt.pdf
(4775 KB)
Inne foldery tego chomika:
Algorithms
Artificial Intelligence
C
Compilers
Hardware
Zgłoś jeśli
naruszono regulamin