CUDA_C_Programming_Guide.pdf

(3326 KB) Pobierz

CUDA C PROGRAMMING GUIDE

PG-02829-001_v7.5 | September 2015

Design Guide

CHANGES FROM VERSION 7.0

‣

Updated

C/C++ Language Support

to:

‣

Added new section

C++11 Language Features,

Clarified that values of const-qualified variables with builtin floating-point types

cannot be used directly in device code when the Microsoft compiler is used as

the host compiler,

‣

Documented the extended lambda feature,

‣

Documented that typeid, std::type_info, and dynamic_cast are only supported

in host code,

‣

Documented the restrictions on trigraphs and digraphs,

‣

Clarified the conditions under which layout mismatch can occur on Windows.

Updated

Table 12

to mention support of half-precision floating-point operations on

devices of compute capabilities 5.3.

Updated

Table 2

with throughput for half-precision floating-point instructions.

Added compute capability 5.3 to

Table 13.

Added the maximum number of resident grids per device to

Table 13.

Clarified the definition of

__threadfence()

Memory Fence Functions.

Mentioned in

Atomic Functions

that atomic functions do not act as memory fences.

‣

www.nvidia.com

CUDA C Programming Guide

PG-02829-001_v7.5 | ii

TABLE OF CONTENTS

Chapter 1. Introduction.........................................................................................1

1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1

1.2. CUDA

: A General-Purpose Parallel Computing Platform and Programming Model.............4

1.3. A Scalable Programming Model.........................................................................5

1.4. Document Structure...................................................................................... 7

Chapter 2. Programming Model............................................................................... 9

2.1. Kernels...................................................................................................... 9

2.2. Thread Hierarchy........................................................................................ 10

2.3. Memory Hierarchy....................................................................................... 12

2.4. Heterogeneous Programming.......................................................................... 14

2.5. Compute Capability..................................................................................... 16

Chapter 3. Programming Interface..........................................................................17

3.1. Compilation with NVCC................................................................................ 17

3.1.1. Compilation Workflow............................................................................. 18

3.1.1.1. Offline Compilation.......................................................................... 18

3.1.1.2. Just-in-Time Compilation....................................................................18

3.1.2. Binary Compatibility............................................................................... 18

3.1.3. PTX Compatibility.................................................................................. 19

3.1.4. Application Compatibility.........................................................................19

3.1.5. C/C++ Compatibility............................................................................... 20

3.1.6. 64-Bit Compatibility............................................................................... 20

3.2. CUDA C Runtime.........................................................................................20

3.2.1. Initialization.........................................................................................21

3.2.2. Device Memory..................................................................................... 21

3.2.3. Shared Memory..................................................................................... 24

3.2.4. Page-Locked Host Memory........................................................................29

3.2.4.1. Portable Memory..............................................................................30

3.2.4.2. Write-Combining Memory....................................................................30

3.2.4.3. Mapped Memory...............................................................................30

3.2.5. Asynchronous Concurrent Execution............................................................ 31

3.2.5.1. Concurrent Execution between Host and Device........................................ 32

3.2.5.2. Concurrent Kernel Execution............................................................... 32

3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32

3.2.5.4. Concurrent Data Transfers.................................................................. 33

3.2.5.5. Streams......................................................................................... 33

3.2.5.6. Events...........................................................................................37

3.2.5.7. Synchronous Calls.............................................................................37

3.2.6. Multi-Device System............................................................................... 38

3.2.6.1. Device Enumeration.......................................................................... 38

3.2.6.2. Device Selection.............................................................................. 38

www.nvidia.com

CUDA C Programming Guide

PG-02829-001_v7.5 | iii

3.2.6.3. Stream and Event Behavior................................................................. 38

3.2.6.4. Peer-to-Peer Memory Access................................................................39

3.2.6.5. Peer-to-Peer Memory Copy..................................................................39

3.2.7. Unified Virtual Address Space................................................................... 40

3.2.8. Interprocess Communication..................................................................... 41

3.2.9. Error Checking......................................................................................41

3.2.10. Call Stack.......................................................................................... 42

3.2.11. Texture and Surface Memory................................................................... 42

3.2.11.1. Texture Memory............................................................................. 42

3.2.11.2. Surface Memory............................................................................. 52

3.2.11.3. CUDA Arrays..................................................................................56

3.2.11.4. Read/Write Coherency..................................................................... 56

3.2.12. Graphics Interoperability........................................................................ 56

3.2.12.1. OpenGL Interoperability................................................................... 57

3.2.12.2. Direct3D Interoperability...................................................................59

3.2.12.3. SLI Interoperability..........................................................................65

3.3. Versioning and Compatibility.......................................................................... 66

3.4. Compute Modes..........................................................................................67

3.5. Mode Switches........................................................................................... 68

3.6. Tesla Compute Cluster Mode for Windows.......................................................... 68

Chapter 4. Hardware Implementation......................................................................69

4.1. SIMT Architecture....................................................................................... 69

4.2. Hardware Multithreading...............................................................................71

Chapter 5. Performance Guidelines........................................................................ 72

5.1. Overall Performance Optimization Strategies...................................................... 72

5.2. Maximize Utilization.................................................................................... 72

5.2.1. Application Level...................................................................................72

5.2.2. Device Level........................................................................................ 73

5.2.3. Multiprocessor Level...............................................................................73

5.2.3.1. Occupancy Calculator........................................................................ 75

5.3. Maximize Memory Throughput........................................................................ 77

5.3.1. Data Transfer between Host and Device....................................................... 78

5.3.2. Device Memory Accesses..........................................................................79

5.4. Maximize Instruction Throughput..................................................................... 83

5.4.1. Arithmetic Instructions............................................................................83

5.4.2. Control Flow Instructions......................................................................... 87

5.4.3. Synchronization Instruction.......................................................................88

Appendix A. CUDA-Enabled GPUs........................................................................... 89

Appendix B. C Language Extensions........................................................................ 90

B.1. Function Type Qualifiers............................................................................... 90

B.1.1. __device__.......................................................................................... 90

B.1.2. __global__...........................................................................................90

B.1.3. __host__............................................................................................. 90

www.nvidia.com

CUDA C Programming Guide

PG-02829-001_v7.5 | iv

B.1.4. __noinline__ and __forceinline__............................................................... 91

B.2. Variable Type Qualifiers................................................................................91

B.2.1. __device__.......................................................................................... 91

B.2.2. __constant__........................................................................................92

B.2.3. __shared__.......................................................................................... 92

B.2.4. __managed__....................................................................................... 93

B.2.5. __restrict__......................................................................................... 93

B.3. Built-in Vector Types................................................................................... 94

B.3.1. char, short, int, long, longlong, float, double................................................ 94

B.3.2. dim3.................................................................................................. 95

B.4. Built-in Variables........................................................................................ 96

B.4.1. gridDim.............................................................................................. 96

B.4.2. blockIdx..............................................................................................96

B.4.3. blockDim.............................................................................................96

B.4.4. threadIdx............................................................................................ 96

B.4.5. warpSize............................................................................................. 96

B.5. Memory Fence Functions...............................................................................96

B.6. Synchronization Functions............................................................................. 99

B.7. Mathematical Functions...............................................................................100

B.8. Texture Functions...................................................................................... 100

B.8.1. Texture Object API............................................................................... 101

B.8.1.1. tex1Dfetch()..................................................................................101

B.8.1.2. tex1D()........................................................................................ 101

B.8.1.3. tex1DLod()....................................................................................101

B.8.1.4. tex1DGrad().................................................................................. 101

B.8.1.5. tex2D()........................................................................................ 101

B.8.1.6. tex2DLod()....................................................................................101

B.8.1.7. tex2DGrad().................................................................................. 102

B.8.1.8. tex3D()........................................................................................ 102

B.8.1.9. tex3DLod()....................................................................................102

B.8.1.10. tex3DGrad().................................................................................102

B.8.1.11. tex1DLayered()............................................................................. 102

B.8.1.12. tex1DLayeredLod().........................................................................102

B.8.1.13. tex1DLayeredGrad()....................................................................... 103

B.8.1.14. tex2DLayered()............................................................................. 103

B.8.1.15. tex2DLayeredLod().........................................................................103

B.8.1.16. tex2DLayeredGrad()....................................................................... 103

B.8.1.17. texCubemap().............................................................................. 103

B.8.1.18. texCubemapLod().......................................................................... 103

B.8.1.19. texCubemapLayered().....................................................................104

B.8.1.20. texCubemapLayeredLod()................................................................ 104

B.8.1.21. tex2Dgather()...............................................................................104

B.8.2. Texture Reference API........................................................................... 105

www.nvidia.com

CUDA C Programming Guide

PG-02829-001_v7.5 | v

Plik z chomika:

sdfg_ds

Inne pliki z tego folderu:

PROFESSIONAL_CUDA_C_Programming_2014_Cheng.pdf (51834 KB)
GPU_Programming_in_MATLAB_2016_Ploskas.pdf (35646 KB)
CUDA_Fortran_for_Scientists_and_Engineers_2014_Ruetsch.pdf (10291 KB)
CUDA_Application_Design_and_Development_2011_Farber.pdf (7085 KB)
The_CUDA_Handbook_2013_Wilt.pdf (4775 KB)

CUDA_C_Programming_Guide.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: