CUDA Thread Block：transpose-技术开发专区

CUDA Thread Block：transpose

作者：heresy live空间编辑：覃里 2009-11-27 15:36 来源：IT168�

　　Thread Block 的分割

　　接下來，首先先參考 CPU 版本的程式：

void computeGold( float* reference, float* idata,
                  const unsigned int size_x, const unsigned int size_y )
{
    // transpose matrix
    for( unsigned int y = 0; y < size_y; ++y)
     {        for( unsigned int x = 0; x < size_x; ++x)
         {            reference[(x * size_y) + y] = idata[(y * size_x) + x];
        }
    }

}

　　原則上成是非常簡單，就是將原來的陣列 idata 中的第 (y * size_x) + x 項取出來，放到新的陣列 reference 中 (x * size_y) + y 的位置;以二維矩陣的方法來看的話，就是把 x, y 的資料放到 y, x 了～

　　實際用圖表示，大概就會像右圖的樣子;而像 (y * size_x) + x 這樣用一維陣列來代替二維矩陣的索引值計算方法，應該也可能透過圖來了解(實際上就是在計算紅色格子的位置，也就是要去算黃色格子的量)。

　　再來，來看用 CUDA 寫的 GPU 程式版本：

__global__ void transpose_naive(float *odata, float* idata,

int width, int height)
{

unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
if (xIndex < width && yIndex < height)

   {
       unsigned int index_in  = xIndex + width * yIndex;
       unsigned int index_out = yIndex + height * xIndex;
       odata[index_out] = idata[index_in];
    }
}

　　在上面 kernel transpose_naive 函式裡的前兩行，就是在計算 thread 本身的 index：xIndex、yIndex;只不過由於 thread 有透過 thread block 來處理，所以還要考慮 block 的 index 和大小。

　　而右邊的圖就是一個簡單的例子。其中 grid (也就是 Block 的數目)是 4*3，thread block 的大小(blockDim)是 3*3;而圖中紅色的格子的 thread，他得到的 blockIdx 就會是 (2,1)，而 threadIdx 則會是 (0,1)。所以這個紅色格子在整體的座標，就是上面程式所列的：

xIndex = blockDim.x * blockIdx.x + threadIdx.x;yIndex = blockDim.y * blockIdx.y + threadIdx.y;

　　而套入數值的話，就是 (3 * 2 + 0, 3 * 1 + 1) = ( 6, 4 )。

　　上面這些，也就是在 CUDA 中，透過 blockDim、blockIdx、threadIdx 這些內建變數，計算出 thread 在整個 grid 中的 index 的標準算法。而在 transpose_naive 中，算出所要處理的 index 後，就可以用一般的方法來計算轉置的動作了～

　　不過，由於把資料分算到各個 thread block 做運算時，每個 block 的大小必須要一樣，所以有可能會產生無法分配均勻的情況。像如果矩陣本身的大小是 23*203 的話，由於數量會超過 GPU 每個 block 的 thread 數目限制，但是又沒辦法均勻的切割;這種情形，一般會用超過原始大小的方法來分配。像假設 Block 大小指定 16*16 的話，就會產生出 2*13 個 block，也就是會有 32 * 208 個 thread 來處理這個矩陣。

　　在這種情形下，超出原始的矩陣大小的計算其實是多餘、不能去做的，所以會再加入一個判斷：

if (xIndex < width && yIndex < height){ ...}

　　也就是確認如果計算出來的 index 是在資料的範圍內，才進行運算。

　　而到此為止，也就是 transpose 的 device code 的一般版本的全部了～接下來，下一篇再來講透過 shared memory 非常好的化過的版本吧!

第1页：CUDA 專案簡介第2页：Thread Block 的分割

关注我们