Thrust快速入门教程-算法-技术开发专区

Thrust快速入门教程-算法

作者：乐渊编译编辑：王玉圆 2011-07-15 00:05 IT168网站原创

　　【IT168专稿】Thrust是一个类似于STL的针对CUDA的C++模板库，能够使程序更简洁易读。Thrust提供与CUDA C完全兼容的接口，可以使我们高效地编写高性能并行程序。通过利用Thrust，程序员得以快速构建CUDA程序，并能够获得极高的稳定性和性能与精度，并行排序等例程的速度可提升5至100倍。

　　在之前文章中，我们给大家介绍了Thrust的快速入门的基础教程 (相关阅读：Thrust快速入门教程-基础)。今天给大家介绍的是Thrust的算法。

以下译文来自忆幽梦的博客，请参见：http://blog.cudachina.org/dreampursue/

　　Thrust提供了大量的常用并行算法。这些算法与STL的算法非常相似，于是我们使用了相同的名称(例如thrust::sort 与std::sort)。

　　所有的Thrust算法可以在主机端和设备端上使用。尤其是，当Thrust算法转入主机端迭代器时，将会调度主机端路径，同样，当使用设备端迭代器时将使用设备端实现。

　　thrust::copy是一个例外，他可以任意的拷贝主机端和设备端的数据。但是所有的迭代器参数必须符合Thrust算法的要求，要么都在主机端，要么都在设备端。当不能满足要求的时候，编译器会报错。

　　Transformations

　　Transformations算法作用是用来将目标容器赋上特定值(例如零)或者特定数列。之前的例子我们已经使用过thrust::fill，可以向所有元素赋特定值。此外transformations算法还包括thrust::sequence、thrust::replace、thrust::transform。完整的列表请参考文档。

　　下面的代码演示了几个transformation算法的用法。注意类似于C++中拥有的thrust::negate和thrust::modulus，Thrust在thrust/functional.h中也提供了，此外还有plus与multiplies等。

# include <thrust / device_vector .h>
# include <thrust / transform .h>
# include <thrust / sequence .h>
# include <thrust / copy .h>
# include <thrust / fill .h>
# include <thrust / replace .h>
# include <thrust / functional .h>
# include <iostream >
int main ( void )
{
// allocate three device_vectors with 10 elements
thrust :: device_vector <int > X (10) ;
thrust :: device_vector <int > Y (10) ;
thrust :: device_vector <int > Z (10) ;
// initialize X to 0,1,2,3, ....
thrust :: sequence (X. begin () , X. end ());
// compute Y = -X
thrust :: transform (X. begin () , X.end () , Y. begin () , thrust :: negate <int >() );
// fill Z with twos
thrust :: fill (Z. begin () , Z. end () , 2);
// compute Y = X mod 2
thrust :: transform (X. begin () , X.end () , Z. begin () , Y. begin () , thrust :: modulus <int -
>() );
// replace all the ones in Y with tens
thrust :: replace (Y. begin () , Y. end () , 1, 10) ;
// print Y
thrust :: copy (Y. begin () , Y. end () , std :: ostream_iterator <int >( std :: cout , "\n"));
return 0;
}

　　thrust/fuctuional.h中的函数提供了大部分内置代数和比较运算，但是我们想提供更多出色的功能。比如，运算y < - a * x + y，x、y为向量，a为常数标量。这其实就是我们所熟知的由BLAS提供的SAXPY运算。

　　如果我们在thrust中实现SAXPY我们有几个选择。一个是，我们需要使用两个transformations(一个加和一个乘法)还有一个临时数则用于存储a乘后的值。另一更佳选择是使用一个单独的由用户自己定义函数的transformation，这才是我们真正先要的。我下面用源代码解释说明这两种方法。

struct saxpy_functor
{
const float a;
saxpy_functor ( float _a) : a(_a) {}
__host__ __device__
float operator ()( const float & x, const float & y) const {
return a * x + y;
}
};
void saxpy_fast ( float A, thrust :: device_vector <float >& X, thrust :: device_vector < -
float >& Y)
{
// Y <- A * X + Y
thrust :: transform (X. begin () , X.end () , Y. begin () , Y. begin () , saxpy_functor (A));
}
void saxpy_slow ( float A, thrust :: device_vector <float >& X, thrust :: device_vector < -
float >& Y)
{
thrust :: device_vector <float > temp (X. size ());
// temp <- A
thrust :: fill ( temp . begin () , temp . end () , A);
// temp <- A * X
thrust :: transform (X. begin () , X.end () , temp . begin () , temp . begin () , thrust :: -
multiplies <float >() );
// Y <- A * X + Y
thrust :: transform ( temp . begin () , temp . end () , Y. begin () , Y. begin () , thrust :: plus < -
float >() );
}

　　Saxpy_fast和saxpy_slow都是有效的SAXPY实现，尽管saxpy_fast会比saxpy_slow更快。忽略临时向量分配与代数运算的花费，其开销如下：

　　fast_saxpy：2N次读取和N次写入

　　slow_saxpy：4N次读取和3N写入

　　因为SAXPY受到内存约束(它的性能受限于内存的带宽，而不是浮点性能)更大量的读写操作使得saxpy_slow开销更加昂贵。而saxpy_fast执行速度与优化的BLAS实现中的SAXPY一样快。在类似SAXPY内存约束算法通常值得使用kernel融合(合并多个计算于单独的kernel)的方法以最小化内存的读写交换。

　　Thrust::transform只支持一个或者两个输入参数的transformations(例如f(x) -> y 和 f(x; y) -> z)。当transformation使用多于两个输入参数的时候需要使用其他方法了。例子arbitrary_transformation展示了使用thrust::zip_interator和thrust::for_each的解决方案。

第1页：Transformations 第2页：Reductions 第3页：Prefix-Sums和Reordering 第4页：Sorting

关注我们