Using popcnt on the GPU

Question

I need to compute

(a & b).count()

over a large set (> 10000) bit vectors (std::bitset<N>) where N is anywhere from 2 ^ 10 to 2 ^16.

const size_t N = 2048;
std::vector<std::vector<char>> distances;
std::vector<std::bitset<N>> bits(100000);
load_from_file(bits);
for(int i = 0; i < bits.size(); i++){
    for(int j = 0; j < bits.size(); j++){
        distance[i][j] = (bits[i] & bits[j]).count();
    }
}

Currently I'm relying on chunked multithreading and SSE/AVX to compute distances. Luckily I can use vpand from AVX to compute the & but my code is still using popcnt (%rax) and a loop to compute the bit counts.

Is there a way I can compute the (a & b).count() function on my GPU (nVidia 760m)? Ideally I would just pass 2 chunks of memory of N bits. I was looking at using thrust but I couldn't find a popcnt function.

EDIT:

Current CPU implementation.

double validate_pooled(const size_t K) const{                           
    int right = 0;                                                          
    const size_t num_examples = labels.size();                              
    threadpool tp;                                                          
    std::vector<std::future<bool>> futs;                                    
    for(size_t i = 0; i < num_examples; i++){                               
        futs.push_back(tp.enqueue(&kNN<N>::validate_N, this, i, K));       
    }                                                                       
    for(auto& fut : futs)                                                   
        if(fut.get()) right++;                                              

    return right / (double) num_examples;                                   
}      

bool validate_N(const size_t cmp, const size_t n) const{                    
    const size_t num_examples = labels.size();                              
    std::vector<char> dists(num_examples, -1);                              
    for(size_t i = 0; i < num_examples; i++){                               
        if(i == cmp) continue;                                              
        dists[i] = (bits[cmp] & bits[i]).count();                           

    }                                                                       
    typedef std::unordered_map<std::string,size_t> counter;                 
    counter counts;                                                         
    for(size_t i = 0; i < n; i++){                                          
        auto iter = std::max_element(dists.cbegin(), dists.cend());         
        size_t idx = std::distance(dists.cbegin(), iter);                   
        dists[idx] = -1; // Remove the top result.                          
        counts[labels[idx]] += 1;                                           
    }                                                                       
    auto iter = std::max_element(counts.cbegin(), counts.cend(),            
            [](const counter::value_type& a, const counter::value_type& b){ return a.second < b.second; }); 

    return labels[cmp] == iter->first;;                                     
}

EDIT:

This is what I've come up with. However its brutally slow. I'm not sure if I'm doing something wrong

template<size_t N>
struct popl 
{
    typedef unsigned long word_type;
    std::bitset<N> _cmp;

    popl(const std::bitset<N>& cmp) : _cmp(cmp) {}

    __device__
    int operator()(const std::bitset<N>& x) const
    {
        int pop_total = 0;
        #pragma unroll
        for(size_t i = 0; i < N/64; i++)
            pop_total += __popcll(x._M_w[i] & _cmp._M_w[i]);

        return pop_total;
    }
}; 

int main(void) {
    const size_t N = 2048;

    thrust::host_vector<std::bitset<N> > h_vec;
    load_bits(h_vec);

    thrust::device_vector<std::bitset<N> > d_vec = h_vec;
    thrust::device_vector<int> r_vec(h_vec.size(), 0);
    for(int i = 0; i < h_vec.size(); i++){
        r_vec[i] = thrust::transform_reduce(d_vec.cbegin(), d_vec.cend(),  popl<N>(d_vec[i]), 0, thrust::maximum<int>());
    }

    return 0;
}

It doesn't seem like `char` would be enough to hold the population count of `std::bitset` where `N` is 1024 up to 65536 ? Do you make some assumptions about the result of the bitwise-and? Even if it did, I guess you're producing (in `distances`) about 10GB of data (for `bits(100000)`)? — Robert Crovella, Oct 05 '14 at 02:40
@RobertCrovella The bitsets are quite sparse. I probably won't be using 2 ^ 16, but I would like to try 2 ^ 14 in my tests. 2 ^ 12 is probably the most likely use case. Also I don't need to store `distances`, that was just as example/simplification. I just need to find the `K` largest distances for each of `bits[i]`, so most of the data is discarded. I'm calculating nearest neighbours using this metric. — en4bz, Oct 05 '14 at 05:37

score 10 · Accepted Answer · edited May 23 '17 at 12:22

CUDA has population count intrinsics for both 32-bit and 64-bit types. (__popc() and __popcll())

These could be used directly in a CUDA kernel or via thrust (in a functor) perhaps passed to thrust::transform_reduce.

If that is the only function you want to do on the GPU, it may be difficult to get a net "win" because of the "cost" of transferring data to/from the GPU. Your overall input data set appears to be about 1GB in size (100000 vectors of bit length 65536), but the output data set appears to be 10-40GB in size based on my calculations (100000 * 100000 * 1-4 bytes per result).

Either the CUDA kernel or the thrust function and data layout should be crafted carefully with the objective of having the code run limited only by memory bandwidth. The cost of data transfer could also be mitigated, perhaps to a large extent, by overlap of copy and compute operations, mainly on the output data set.

At first glance, this problem appears to be somewhat similar to the problem of computing euclidean distances among sets of vectors, so this question/answer may be of interest, from a CUDA perspective.

EDIT: adding some code that I used to investigate this. I am able to get a significant speedup (~25x including data copy time) over a naive single-threaded CPU implementation, but I don't know how fast the CPU version would be using "chunked multithreading and SSE/AVX ", so it would be interesting to see more of your implementation or get some performance numbers. I also don't think the CUDA code I have here is highly optimized, it's just a "first cut".

In this case, for proof-of-concept, I focused on a small problem size, N=2048, 10000 bitsets. For this small problem size, I can fit enough of the vector of bitsets in shared memory, for a "small" threadblock size, to take advantage of shared memory. So this particular approach would have to be modified for larger N.

$ cat t581.cu
#include <iostream>
#include <vector>
#include <bitset>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define nTPB 128
#define OUT_CHUNK 250
#define N_bits 2048
#define N_vecs 10000
const size_t N = N_bits;

__global__ void comp_dist(unsigned *in, unsigned *out, unsigned numvecs, unsigned start_idx, unsigned end_idx){
  __shared__ unsigned sdata[(N/32)*nTPB];
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < numvecs)
    for (int i = 0; i < (N/32); i++)
      sdata[(i*nTPB)+threadIdx.x] = in[(i*numvecs)+idx];
  __syncthreads();
  int vidx = start_idx;
  if (idx < numvecs)
    while (vidx < end_idx) {
      unsigned sum = 0;
      for (int i = 0; i < N/32; i++)
        sum += __popc(sdata[(i*nTPB)+ threadIdx.x] & in[(i*numvecs)+vidx]);
      out[((vidx-start_idx)*numvecs)+idx] = sum;
      vidx++;}
}

void cpu_test(std::vector<std::bitset<N> > &in, std::vector<std::vector<unsigned> > &out){

  for (int i=0; i < in.size(); i++)
    for (int j=0; j< in.size(); j++)
      out[i][j] = (in[i] & in[j]).count();
}

int check_data(unsigned *d1, unsigned start_idx, std::vector<std::vector<unsigned> > &d2){
  for (int i = start_idx; i < start_idx+OUT_CHUNK; i++)
    for (int j = 0; j<N_vecs; j++)
      if (d1[((i-start_idx)*N_vecs)+j] != d2[i][j]) {std::cout << "mismatch at " << i << "," << j << " was: " << d1[((i-start_idx)*N_vecs)+j] << " should be: " << d2[i][j] << std::endl;  return 1;}
  return 0;
}

unsigned long long get_time_usec(){
  timeval tv;
  gettimeofday(&tv, 0);
  return (unsigned long long)(((unsigned long long)tv.tv_sec*1000000ULL)+(unsigned long long)tv.tv_usec);
}

int main(){

  unsigned long long t1, t2;
  std::vector<std::vector<unsigned> > distances;
  std::vector<std::bitset<N> > bits;

  for (int i = 0; i < N_vecs; i++){
    std::vector<unsigned> dist_row(N_vecs, 0);
    distances.push_back(dist_row);
    std::bitset<N> data;
    for (int j =0; j < N; j++) data[j] = rand() & 1;
    bits.push_back(data);}
  t1 = get_time_usec();
  cpu_test(bits, distances);
  t1 = get_time_usec() - t1;
  unsigned *h_data = new unsigned[(N/32)*N_vecs];
  memset(h_data, 0, (N/32)*N_vecs*sizeof(unsigned));
  for (int i = 0; i < N_vecs; i++)
    for (int j = 0; j < N; j++)
        if (bits[i][j]) h_data[(i)+((j/32)*N_vecs)] |= 1U<<(31-(j&31));

  unsigned *d_in, *d_out1, *d_out2, *h_out1, *h_out2;
  cudaMalloc(&d_in, (N/32)*N_vecs*sizeof(unsigned));
  cudaMalloc(&d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaMalloc(&d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);
  h_out1 = new unsigned[N_vecs*OUT_CHUNK];
  h_out2 = new unsigned[N_vecs*OUT_CHUNK];
  t2 = get_time_usec();
  cudaMemcpy(d_in, h_data, (N/32)*N_vecs*sizeof(unsigned), cudaMemcpyHostToDevice);
  for (int i = 0; i < N_vecs; i += 2*OUT_CHUNK){
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream1>>>(d_in, d_out1, N_vecs, i, i+OUT_CHUNK);
    cudaStreamSynchronize(stream2);
    if (i > 0) if (check_data(h_out2, i-OUT_CHUNK, distances)) return 1;
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream2>>>(d_in, d_out2, N_vecs, i+OUT_CHUNK, i+2*OUT_CHUNK);
    cudaMemcpyAsync(h_out1, d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream1);
    cudaMemcpyAsync(h_out2, d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream2);
    cudaStreamSynchronize(stream1);
    if (check_data(h_out1, i, distances)) return 1;
    }
  cudaDeviceSynchronize();
  t2 = get_time_usec() - t2;
  std::cout << "cpu time: " << ((float)t1)/(float)1000 << "ms gpu time: " << ((float) t2)/(float)1000 << "ms" << std::endl;
  return 0;
}
$ nvcc -O3 -arch=sm_20 -o t581 t581.cu
$ ./t581
cpu time: 20324.1ms gpu time: 753.76ms
$

CUDA 6.5, Fedora20, Xeon X5560, Quadro5000 (cc2.0) GPU. The above test case includes results verification between the distances data produced on the CPU vs. the GPU. I've also broken this into a chunked algorithm with results data transfer (and verification) overlapped with compute operations, to make it more easily extendable to the case where there is a very large amount of output data (e.g. 100000 bitsets). I haven't actually run this through the profiler yet, however.

EDIT 2: Here's a "windows version" of the code:

#include <iostream>
#include <vector>
#include <bitset>
#include <stdlib.h>
#include <time.h>


#define nTPB 128
#define OUT_CHUNK 250
#define N_bits 2048
#define N_vecs 10000
const size_t N = N_bits;

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)



__global__ void comp_dist(unsigned *in, unsigned *out, unsigned numvecs, unsigned start_idx, unsigned end_idx){
  __shared__ unsigned sdata[(N/32)*nTPB];
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < numvecs)
    for (int i = 0; i < (N/32); i++)
      sdata[(i*nTPB)+threadIdx.x] = in[(i*numvecs)+idx];
  __syncthreads();
  int vidx = start_idx;
  if (idx < numvecs)
    while (vidx < end_idx) {
      unsigned sum = 0;
      for (int i = 0; i < N/32; i++)
        sum += __popc(sdata[(i*nTPB)+ threadIdx.x] & in[(i*numvecs)+vidx]);
      out[((vidx-start_idx)*numvecs)+idx] = sum;
      vidx++;}
}

void cpu_test(std::vector<std::bitset<N> > &in, std::vector<std::vector<unsigned> > &out){

  for (unsigned i=0; i < in.size(); i++)
    for (unsigned j=0; j< in.size(); j++)
      out[i][j] = (in[i] & in[j]).count();
}

int check_data(unsigned *d1, unsigned start_idx, std::vector<std::vector<unsigned> > &d2){
  for (unsigned i = start_idx; i < start_idx+OUT_CHUNK; i++)
    for (unsigned j = 0; j<N_vecs; j++)
      if (d1[((i-start_idx)*N_vecs)+j] != d2[i][j]) {std::cout << "mismatch at " << i << "," << j << " was: " << d1[((i-start_idx)*N_vecs)+j] << " should be: " << d2[i][j] << std::endl;  return 1;}
  return 0;
}

unsigned long long get_time_usec(){

  return (unsigned long long)((clock()/(float)CLOCKS_PER_SEC)*(1000000ULL));
}

int main(){

  unsigned long long t1, t2;
  std::vector<std::vector<unsigned> > distances;
  std::vector<std::bitset<N> > bits;

  for (int i = 0; i < N_vecs; i++){
    std::vector<unsigned> dist_row(N_vecs, 0);
    distances.push_back(dist_row);
    std::bitset<N> data;
    for (int j =0; j < N; j++) data[j] = rand() & 1;
    bits.push_back(data);}
  t1 = get_time_usec();
  cpu_test(bits, distances);
  t1 = get_time_usec() - t1;
  unsigned *h_data = new unsigned[(N/32)*N_vecs];
  memset(h_data, 0, (N/32)*N_vecs*sizeof(unsigned));
  for (int i = 0; i < N_vecs; i++)
    for (int j = 0; j < N; j++)
        if (bits[i][j]) h_data[(i)+((j/32)*N_vecs)] |= 1U<<(31-(j&31));

  unsigned *d_in, *d_out1, *d_out2, *h_out1, *h_out2;
  cudaMalloc(&d_in, (N/32)*N_vecs*sizeof(unsigned));
  cudaMalloc(&d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaMalloc(&d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned));
  cudaCheckErrors("cudaMalloc fail");
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);
   cudaCheckErrors("cudaStrem fail");
  h_out1 = new unsigned[N_vecs*OUT_CHUNK];
  h_out2 = new unsigned[N_vecs*OUT_CHUNK];
  t2 = get_time_usec();
  cudaMemcpy(d_in, h_data, (N/32)*N_vecs*sizeof(unsigned), cudaMemcpyHostToDevice);
   cudaCheckErrors("cudaMemcpy fail");
  for (int i = 0; i < N_vecs; i += 2*OUT_CHUNK){
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream1>>>(d_in, d_out1, N_vecs, i, i+OUT_CHUNK);
    cudaCheckErrors("cuda kernel loop 1 fail");
    cudaStreamSynchronize(stream2);
    if (i > 0) if (check_data(h_out2, i-OUT_CHUNK, distances)) return 1;
    comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream2>>>(d_in, d_out2, N_vecs, i+OUT_CHUNK, i+2*OUT_CHUNK);
    cudaCheckErrors("cuda kernel loop 2 fail");
    cudaMemcpyAsync(h_out1, d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream1);
    cudaMemcpyAsync(h_out2, d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream2);
    cudaCheckErrors("cuda kernel loop 3 fail");
    cudaStreamSynchronize(stream1);
    if (check_data(h_out1, i, distances)) return 1;
    }
  cudaDeviceSynchronize();
  cudaCheckErrors("cuda kernel loop 4 fail");
  t2 = get_time_usec() - t2;
  std::cout << "cpu time: " << ((float)t1)/(float)1000 << "ms gpu time: " << ((float) t2)/(float)1000 << "ms" << std::endl;
  return 0;
}

I've added CUDA error checking to this code. Be sure to build a release project in Visual Studio, not debug. When I run this on a windows 7 laptop with a Quadro1000M GPU I get about 35 seconds for the CPU execution and about 1.5 seconds for the GPU.

I'm trying to find the `K` nearest neighbours using the given metric. This means I don't have to store `distances` only the `K` largest (where `K < 50`). Also the bitsets are quite sparse. I'd like to load `bits` onto the GPU then transfer back a `std::vector` to calculate the `K` largest elements (and their indices) for each `bits[i]`. I have 3GB of VRAM and 16GB of RAM. Do you think transferring the 100000 `std::vector`'s back to the CPU for processing will eliminate any speed-up I would get on the GPU? — en4bz, Oct 05 '14 at 05:49
I would move as much of the algorithm as makes sense onto the GPU. For example, finding the `K` largest elements. That would reduce the cost of data movement. However, as a start you could do just what you have shown here. In that case I would focus on overlap of copy and compute -- it will be pretty much necessary anyway due to the output data set size -- and see what the comparison is like. Do you have a current CPU case worked out with specifics and a performance measurement? — Robert Crovella, Oct 05 '14 at 13:49
Please see my edit. I tried my own implementation but its brutally slow. For some reason I spend ~1/4 of the time in system calls (~7 minutes), 24 minutes in total. Via CPU its about 3 minutes. I'll try yours in the mean time. Thanks for the help! — en4bz, Oct 07 '14 at 00:17
On my code, when I increase the `N_vecs` to 100000, the CPU time is ~40 minutes and the GPU time is ~45 seconds, so about a 50x speedup. It would be nice to know what your CPU implementation is like, at least the size of data that corresponds to the 3 minute measurement. Is that for `N` = 2048? Is it for 100000 bitsets? — Robert Crovella, Oct 07 '14 at 00:50
@RobertCorvella I've added the CPU implementation core parts. I've switched to a thredpool to maximize my throughput. I think 3 minutes was with `bitset<4096>` and K ~ 15. The number of bitsets is 100000. I tried your code by its check function failted. I think this has to do with the sizes you are assuming. `std::bitset` uses a unsigned long as its internal representation. I see you are using `unsigned` and dividing by 32 alot. — en4bz, Oct 07 '14 at 01:08
I think the reason for the failure doesn't have anything to do with the internal representation of bitset. None of my access methods assume anything about internal representation. I only access a bit at a time. I left CUDA error checking out of my code for brevity. run my code with `cuda-memcheck` and see if any errors are reported on your machine. Are you running on a windows laptop? And did you run my code as-is or change anything? — Robert Crovella, Oct 07 '14 at 01:19

score 1 · Answer 2 · answered Oct 05 '14 at 19:30

1

OpenCL 1.2 has popcount which would seem to do what you want. It can work on a vector, so up to ulong16 which is 1024 bits at a time. Note that NVIDIA drivers only support OpenCL 1.1 which does not include this function.

Of course you could just use a function or table to compute it pretty quickly, so an OpenCL 1.1 implementation is possible as well, and would likely run at the memory bandwidth of the device.

answered Oct 05 '14 at 19:30

Dithermaster

6,223
1
12
20

1

For the NVIDIA case, it's also quite simple to use the `popc` PTX instruction with a small bit inline assembly. See: https://github.com/kylelutz/compute/blob/master/include/boost/compute/functional/detail/nvidia_popcount.hpp – Kyle Lutz Oct 06 '14 at 03:49

Using popcnt on the GPU

EDIT:

EDIT:

2 Answers2