Bitwise shift of buffer in CUDA

Question

Is there any way to memmove a buffer in CUDA in a bitwise manner? E.g., for a buffer with two bytes and a pointer

buf -> 00000000 11111111

I would like to shift bit portions left or right given their bit offset. Something like

void memmove(void* buf, int from, int bits, int delta)

For the buffer above I would expect then:

00000111 11111111

after calling

memmove(buf,8,3,-3)

I could not find a proper function for this in CUDA's documentation or Google.

Please notice: I am not interested in cudaMemcpy or the normal memmove. A bitwise memmove is what I need!
I am also aware on how to do this with normal CPU, I need a more efficient alternative and therefore I am considering CUDA.

bit shift operators work in CUDA. Can you build this from bit shift and bitwise and/or? i.e. buf = (buf & 0x11111111)<<3 | (buf & 0x111) — Levi Barnes, Sep 26 '14 at 11:40
This won't work for a buffer of arbitrary size. I guess an initial approach could be to read the bits to write in each byte (one thread handles one byte), synchronize the threads, and then write the bits in their new locations — user46317, Sep 26 '14 at 11:48
Don't read single bits. 4 bytes/thread is probably better. A warp must read 128 bytes together so reading anything less than 4 bytes (128 bytes / 32 threads) slows you down — Levi Barnes, Sep 26 '14 at 12:04
It's not clear why you need 4 parameters. Isn't a pointer to the buffer and a signed number of bit positions to shift enough? In the arbitrary case (arbitrary buffer size, arbitrary number of bits to shift and direction) this doesn't look trivial to implement on the CPU either. It would probably be instructive to provide a reference CPU implementation that does what you want. — Robert Crovella, Sep 26 '14 at 12:58
The "bits" parameter tells how many bits to shift. The "from" parameter tells the offset of the left-most bit to shift. So I do need 4 parameters, the CPU version is as you say quite complex (that's why I did not post it), but I was expecting it to be simpler when using CUDA (or already implemented in some library...) — user46317, Sep 26 '14 at 13:12
The only constraint is that it should not be "too small", since in that case I will need lots of different buffers (and the indexing will take too much time). It is not a problem to have a reasonably large buffer, 4 to 256 KB say, if I can process it fast with CUDA. — user46317, Sep 28 '14 at 21:26
I would load two consecutive ints per thread, shift these by s%32 bits, then write to a location shifted by s/32 ints. You might have trouble trying to write in place and you'll get lousy performance if this is the only operation you're doing on the GPU. Do you have more work to do on the GPU? — Levi Barnes, Sep 28 '14 at 22:19
Depends on the application. The shifting is more relevant for update-intensive applications, query-intensive applications do these shifting operations less often. In update-intensive scenarios it is acceptable to do the shifting at host (memmove based), in query-intensive scenarios this could also be done this way, since few memory transfers are required to get the updated buffer back to device memory. Unfortunately, the most interesting scenario is a mixed scenario, where doing so can be expected to have a poor performance (compared to a CPU-only approach) — user46317, Sep 29 '14 at 10:38

Bitwise shift of buffer in CUDA

0 Answers0