1

I am in some trouble in arranging the threads according to my 2D data array.

It is a compact array where every integer contains 32 bit values [1000110001000000010000000000010] representing transactions and I need to count the bits row wise(I have used integer instead of bit vector/bitset). Array is of dimension 1000*3125. Every row contains 1 lakh bit values.

I need to count total bits that are set to 1 for each row ie. for 3125 columns in each row. How should I arrange the threads/ loop for optimum performance?

Gwenc37
  • 2,064
  • 7
  • 18
  • 22
  • 1
    The answer to this post [Efficient method to check for matrix stability in CUDA](http://stackoverflow.com/questions/13443968/efficient-method-to-check-for-matrix-stability-in-cuda/23941330#23941330) on how performing bitwise reduction using vote intrinsics could be helpful to you. – Vitality Jun 05 '14 at 12:22

1 Answers1

3

You can use a standard parallel reduction approach. You would do one parallel reduction per row of your matrix. The only difference is that each thread will need to pick up a 32-bit value and compute the number of set bits first.

Counting the set bits is easy using the __popc() intrinsic, which returns the number of bits set in a 32-bit parameter.

For the parallel reduction part, if you're looking for the fastest possible performance use CUB instead of writing your own.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257