CUDA Shared Memory：transpose-技术开发专区

CUDA Shared Memory：transpose

作者：heresy live空间编辑：覃里 2009-11-27 15:54 来源：IT168�

　　【IT168 文档】延續上一篇講 thread block，這一篇能然是繼續講 transpose 這個範例程式;不過這一篇，則是將焦點放在 shared memory 的使用，也就是 transpose 這個非常好的化過的 kernel 函式。

　　首先先來看一下 shared memory 的特色。在《nVidia CUDA 簡介》中，有大概提過各種不同記憶的差異了～而 shared memory 基本上是在晶片上的，只能由同一個 thread block 裡的 thread 來進行存取;而他的優點在於他的速度比 global memory 或 local memory 都來的快!實際上，在最好的情況下(避免掉 bank conflict)，應該是可以和 register 一樣快的～(關於 shared memory 的效率問題，詳細內容可參考《CUDA Programming Guide 1.1》的 5.1.2.4)

　　而在《CUDA Programming Guide 1.1》的《5.1.2 Memory Bandwidth》中，也提供了標準的 shared memory 使用的 pattern：

將資料由 device memory 轉存到 shared memory
讓 block 中的 thread 同步，如此才能確保每個 thread 能安全的讀到其他 thread 寫到 shared memory 的資料
處理 shared memory 中的資料
如果需要，再同步一次;讓 shared memory 裡的資料完成更新
將結果寫回到 device memory

　　而在 transpose 這個例子裡，他透過 shared memory 來做非常好的化，其實主要應該不是他的速度，而是要透過使用 shared memory 來避免對於傳進的 global memory 有 non-coalesced 的存取，而同時也避免使用 shared memory 時有 bank conflict 而拖慢速度。

　　而其非常好的化後的程式碼如下：

// This kernel is optimized to ensure all global reads and writes are coalesced,
// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster
// than the naive kernel below.  Note that the shared memory array is sized to
// (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory
// so that bank conflicts do not occur when threads address the array column-wise.
__global__ void transpose(float *odata, float *idata, int width, int height)
{
    __shared__ float block[BLOCK_DIM][BLOCK_DIM+1];
    // read the matrix tile into shared memory
    unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
    unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
    if((xIndex < width) && (yIndex < height))
    {
        unsigned int index_in = yIndex * width + xIndex;
        block[threadIdx.y][threadIdx.x] = idata[index_in];
    }

    __syncthreads();

    // write the transposed matrix tile to global memory
    xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
    yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
    if((xIndex < height) && (yIndex < width))
    {
        unsigned int index_out = yIndex * height + xIndex;
        odata[index_out] = block[threadIdx.x][threadIdx.y];
    }
}

　　在上面的程式中可以看出來，第一行的

__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];

　　就是透過 __shared__ 這個變數的 qualifier 來宣告出一個大小是 BLOCK_DIM * (BLOCK_DIM+1) 的二維陣列。而為什麼這樣宣告呢?主要就是因為 shared memory 是在同一個 block 裡的 thread 共用的，而在這個範例裡，每個 block 裡都有 BLOCK_DIM * BLOCK_DIM 個 thread 要來進行計算;因此，他宣告這個二維陣列的目的，就是在 shared memory 裡產生一塊記憶體空間(block)，拿來存放每一個 thread 要從 global memory 中讀取的資料(idata)。但是為什麼會變成 BLOCK_DIM * (BLOCK_DIM+1) 呢?在函式的註解釋說這是為了避免在對陣列做 column-wise 存取時產生 bank conflict 而影響效率，這點 Heresy 也還沒研究完，所以就先跳過了。

　　而接下來的動作，就是把資料從 global memory 裡複製到 shared memory 了～而這邊做的事，除了本來該做的條件判斷外，再來就是計算資料在 idata 中對應的位置，並把他複製到 block 這個 shared memory 的變數裡了。

unsigned int index_in = yIndex * width + xIndex;block[threadIdx.y][threadIdx.x] = idata[index_in];

　　而由於這些步驟都是每個 thread 各自做，然後各自填到 block 中，所以在結束後，必須要呼叫 __syncthreads() 這個函式來對 block 中的 thread 同步，確定 block 中的所有 thread 都已經完成 block[threadIdx.y][threadIdx.x] 的寫入動作。在 __syncthreads() 執行完成後，這個 thread block 應該已經從 idata 這個 array 中取出 BLOCK_DIM * BLOCK_DIM 大小的區域，放置到 shared memory 了～而由於每個 thread block 都會做同樣的動作，所以就相當於整個 array 被切割成數個大小是 BLOCK_DIM * BLOCK_DIM 的部分來處理了。

　　而再接下來的程式，就是透過已經從 idata 複製出來的局部的矩陣資料 block 中，把值再填入要輸出的變數 odata 中。而由於他是直接把值寫到屬於 global memory 的變數，所以其實就不需要呼叫 __syncthreads() 來對 block 中的 thread 做同步了。

　　而到這邊為止，大概就是 CUDA 中 Shared memory 基本的使用方法了～其他的東西，以後有空再說吧～ ^^"

关注我们