【IT168 文档】Thread Execution:
A warp of 32 threads physically running on SM
Sharing instructions
4 cycles for 1 warp instruction
Dynamically scheduled by SM
• Executed when operands ready
Block IDs and Thread IDs
different dimension for what use?
----To Simplify memory addressing when processing multidimensional data
Image processing
Solving PDEs on volumes
Threads Hierarchy
thread
Thread block
Cooperative Thread Array (CTA)
Max 512 threads
Grid
Share data in global memory
Dynamically scheduled at runtime
Kernel
The part of code running on threads
CUDA Memory Model
Global memory
Contents visible to all threads
Shared memory
shared by all threads in one block
Constant memory
CUDA extends C
Declaration specs
global, device, shared,local, constant
Runtime API
Memory, symbol,execution management
Type & Scope
__device__ glob/grid
__device__ __constant__ constant/grid
Automatic variables without any qualifier reside in a register
Except arrays that reside in local memory
Variables:
Built-in Vector Types
int1, int2, int3, int4, float1, float2, float3, float4,...
Define by a constructor function make_
• int4 make_int4 (int w, int x, int y, int z)
• int4 iv(1, 2, 3, 4)
– iv.w = 1, iv.x = 2, iv.y = 3, iv.z = 4
Built-in dim3 Type
gridDim, blockDim, blockIdx, threadIdx