I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease.
What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example code below. The part that I am unsure of how to tackle is dealing with all the extra fill space that is in d_res and thus h_res, after the compaction happens.
The example just uses a 0-99 sequence with all the even entries set to zero. This is just an example, and the real problem will be a general sparse array.
This answer here helped me greatly, although when it comes to reading out the data, the size is just known to be constant: How to quickly compact a sparse array with CUDA C?
I suspect that I can work around this by counting the number of 0's in d_src, and then only allocating d_res to be that size, or doing the count after the compaction, and only copying that many element. Is that really the right way to do it?
I get the sense that there will be some easy fix for this, via clever use of iterators or some other feature of thrust.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
//Predicate functor
struct is_not_zero
{
__host__ __device__
bool operator()(const int x)
{
return (x != 0);
}
};
using namespace std;
int main(void)
{
size_t N = 100;
//Host Vector
thrust::host_vector<int> h_src(N);
//Fill with some zero and some nonzero data, as an example
for (int i = 0; i < N; i++){
if (i % 2 == 0){
h_src[i] = 0;
}
else{
h_src[i] = i;
}
}
//Print out source data
cout << "Source:" << endl;
for (int i = 0; i < N; i++){
cout << h_src[i] << " ";
}
cout << endl;
//copies to device
thrust::device_vector<int> d_src = h_src;
//Result vector
thrust::device_vector<int> d_res(d_src.size());
//Copy non-zero elements from d_src to d_res
thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());
//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), d_res.end());
//thrust::host_vector<int> h_res = d_res; //Or just this?
//Show results
cout << "h_res size is " << h_res.size() << endl;
cout << "Result after remove:" << endl;
for (int i = 0; i < h_res.size(); i++){
cout << h_res[i] << " ";
}
cout << endl;
return 0;
}
Also, I am a novice with thrust, so if the above code has any obvious flaws that go against recommended practices for using thrust, please let me know.
Similarly, speed is always of interest. Reading some of the various thrust tutorials, it seems like little changes here and there can be big speed savers or wasters. So, please let me know if there is a smart way to speed this up.