high performance prefix sum / scan function in CUDA, looking for thrust, cuDPP library alterative

Question

I'm looking for high performance multiscan / multi prefix-sum (many rows in a one kernel execution) function for my project in CUDA.

I've tried the one from Thrust library but it's a way too slow. Also thrust crash after being compiled with nvcc debug flags (-g -G).

After my failure with Thrust I focused on cuDPP library which used to be a part of CUDA toolkit. The cuDPP performance is really good but the library is not up to date with latest cuda 5.5 and there are some global memory violation issues in cudppMultiScan() function while debugging with memory checker. (cuda 5.5, nsight 3.1, visual studio 2010, gtx 260 cc 1.3)

Does anybody have any idea what to use instead of these two libraries?

R.

Have you looked at [ArrayFire](http://accelereyes.com/arrayfire), which we work on at AccelerEyes? — arrayfire, Sep 01 '13 at 19:10
no, haven't seen this before, looks pretty interesting! thanks! :) what about its performance? Is it more productivity or performance oriented library? — user1946472, Sep 01 '13 at 22:13
If you want to use Thrust to scan the rows of a matrix, don't call `inclusive_scan` repeatedly. Assign each row an index and use `inclusive_scan_by_key`. You can adapt this [example](https://github.com/thrust/thrust/blob/master/examples/sum_rows.cu). — Jared Hoberock, Sep 02 '13 at 01:14
@user1946472 For a single vector it is either better (at thousands of elements) or equal to thrust (at million of elements). For multiple matrices, arrayfire launches a single kernel and hence is faster than launching thrust multiple times. Source: I wrote the code. You can contact me (email on my profile) for more information. — Pavan Yalamanchili, Sep 03 '13 at 13:17
@JaredHoberock Having to read an extra vector for a memory bound algorithm is not ideal. However it is better than launching the kernels multiple times. — Pavan Yalamanchili, Sep 03 '13 at 13:19

score 2 · Answer 1 · answered Sep 01 '13 at 18:48

These libraries, especially thrust, try to be as generic as possible and optimization often requires specialization: For example a specialization of an algorithm can use shared memory for fundamental types (like int or float) but the generic version can't. It happens that for a particular situation a specialization is missing!

It's a good idea to use these well tested generic libraries as much as possible but sometimes, for some performance critical sections, your own implementation is an option to consider.

In your situation you want many scans in parallel for different rows. A good implementation would not run the scan separately for different rows: It would have the same kernel call running simultaneously for all elements of all the rows. Depending on its index, a thread can know which row it's processing and will ignore all data out of the row.

Such specialization requires a functor that returns an absorbing value that prevent mixing rows. Still, your own careful implementation would be likely way faster.

score 2 · Answer 2 · answered Sep 01 '13 at 20:35

2

To write your own prefix scan, you may refer to

The scan example of the CUDA SDK;
Chapter 13 of N. Wilt, "The CUDA Handbook";
Chapter 6 of S. Cook, "CUDA Programming, A Developer's Guide to Parallel Computing with GPUs";
Parallel Prefix Sum (Scan) with CUDA.

To do multi prefix-sum you can launch many times the same kernel (as suggested by a.lasram) or try to achieve cuncurrency by CUDA streams, although I do not know it this will effectively work for your card.

answered Sep 01 '13 at 20:35

Vitality

20,705
4
108
146

Using streams is an excellent idea but I think it's even better to launch one single kernel where each thread would "clamp" the computation in one selected row – a.lasram Sep 01 '13 at 21:51
I have 231 rows of 1424 floats so executing each row in separate kernel gives too big time overhead caused by cudalaunch. cuDPP does this job in about 0.11 ms on my machine (gtx260) which for me is excellent result! In case of performance cuDPP lib is perfect. For now I'll try arrayFire library suggested by @accelereyes. Thank you for your answer. – user1946472 Sep 01 '13 at 21:54

high performance prefix sum / scan function in CUDA, looking for thrust, cuDPP library alterative

2 Answers2