Emulating std::bitset in CUDA

Question

I have an input array that is given to a kernel. Each thread works with one value of the array and either changes the value or doesn't change it at all according to a rule.

I would like to find out very quickly afterwards if there was any change inside the input memory and, in case there was, I would want to find very quickly where this change occurred (index of the input array).

I thought of using something like an array of bits. The total amount of bits would be equal to the total amount of threads. Each thread would manipulate only one bit, so initially the bits would be set to false, if a thread changes the corresponding input value the bit will become true.

In order to make it more clear, let's suppose we have this input array called A

1 9 3 9 4 5

The array of bits would be the following

0 0 0 0 0 0

So we would have 6 threads working on the input array. Let's suppose that the final input array will be

1 9 3 9 2 5

So the final array of bits would be:

0 0 0 0 1 0

I don't want to use an array of bool because each of the values will take 1 byte of memory which is quite a lot since I want to work only using bits.

Is it possible to achieve something like this?

I thought of creating a char array where each value of the array will have 8 bits. However, what if two threads would like to change different bits of the first character of the array? They would have to do the operation atomically even though the change inside the bit will be on different locations. So using atomic operations will probably disrupt the parallelism, and in this case using atomic operations is not needed, it doesn't make any sense, but would have to be used because of the constraints of using an array of chars instead of something more specialized like an std::bitset

Thank you in advance.

This question looks a lot like yours: http://stackoverflow.com/questions/11042816/how-to-create-large-bit-array-in-cuda — BenC, May 19 '13 at 13:13
Thanks I read the question and answer however it doesn't say anything about how I can use array of bits or if there's anything like an `std::bitset` in CUDA. Using an array of `bool` is not a good idea for me because I can't use too much memory in the GPU. — ksm001, May 19 '13 at 14:30

score 3 · Accepted Answer · answered May 26 '14 at 20:48

I'm providing a late answer to this question to remove it from the unanswered list.

To do what you want to achieve you can define an array of unsigned ints of length N/32, where N is the length of the arrays you are comparing. Then you can use atomicAdd to write each bit of such an array, depending on whether two elements of the arrays are equal or not.

Below I'm providing a simple example:

#include <iostream>

#include <thrust\device_vector.h>

__device__ unsigned int __ballot_non_atom(int predicate)
{
    if (predicate != 0) return (1 << (threadIdx.x % 32));
    else return 0;
}

__global__ void check_if_equal_elements(float* d_vec1_ptr, float* d_vec2_ptr, unsigned int* d_result, int Num_Warps_per_Block)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    const unsigned int warp_num = threadIdx.x >> 5;

    atomicAdd(&d_result[warp_num+blockIdx.x*Num_Warps_per_Block],__ballot_non_atom(!(d_vec1_ptr[tid] == d_vec2_ptr[tid])));
}

// --- Credit to "C printing bits": http://stackoverflow.com/questions/9280654/c-printing-bits
void printBits(unsigned int num){
    unsigned int size = sizeof(unsigned int);
    unsigned int maxPow = 1<<(size*8-1);
    int i=0;
    for(;i<size;++i){
        for(;i<size*8;++i){
            // print last bit and shift left.
            printf("%u ",num&maxPow ? 1 : 0);
            num = num<<1;
        }       
    }
}

void main(void)
{
    const int N = 64;

    thrust::device_vector<float> d_vec1(N,1.f);
    thrust::device_vector<float> d_vec2(N,1.f);

    d_vec2[3] = 3.f;
    d_vec2[7] = 4.f;

    unsigned int Num_Threads_per_Block      = 64;
    unsigned int Num_Blocks_per_Grid        = 1;
    unsigned int Num_Warps_per_Block        = Num_Threads_per_Block/32;
    unsigned int Num_Warps_per_Grid         = (Num_Threads_per_Block*Num_Blocks_per_Grid)/32;

    thrust::device_vector<unsigned int> d_result(Num_Warps_per_Grid,0);

    check_if_equal_elements<<<Num_Blocks_per_Grid,Num_Threads_per_Block>>>((float*)thrust::raw_pointer_cast(d_vec1.data()),
                                                                          (float*)thrust::raw_pointer_cast(d_vec2.data()),
                                                                          (unsigned int*)thrust::raw_pointer_cast(d_result.data()),
                                                                          Num_Warps_per_Block);

    unsigned int val = d_result[1];
    printBits(val);
    val = d_result[0];
    printBits(val);

    getchar();
}

Emulating std::bitset in CUDA

1 Answers1