I'm looking for high performance multiscan / multi prefix-sum (many rows in a one kernel execution) function for my project in CUDA.
I've tried the one from Thrust library but it's a way too slow. Also thrust crash after being compiled with nvcc debug flags (-g -G).
After my failure with Thrust I focused on cuDPP library which used to be a part of CUDA toolkit. The cuDPP performance is really good but the library is not up to date with latest cuda 5.5 and there are some global memory violation issues in cudppMultiScan() function while debugging with memory checker. (cuda 5.5, nsight 3.1, visual studio 2010, gtx 260 cc 1.3)
Does anybody have any idea what to use instead of these two libraries?
R.