Best way to implement bit shifting over a block of memory

Question

I'd like to bit shift a pointed block of memory of arbitrary size. I thought about working on the problem char by char and I found out this solution that works.

void rshift(void *self, int shift, size_t size) {
    char *s = self;
    bool take, pass;
    for (int i = 0; i < shift; i++) {
        take = false;
        for (int j = 0; j < size; j++, take = pass) {
            pass = *(s + j) & 1;
            *(s + j) >>= 1;
            *(s + j) ^= (-take ^ *(s + j)) & (1 << (CHAR_BIT - 1));
        }
    }
}

Where self is the memory block to shift, size is its size in chars and shift is how many bits to shift. Essentially what I'm doing is, for each byte I shift it by one passing the bit that has been lost to the next byte.

However this algorithm is quite terrible, I think it's like a O(shift * size) or something, and I'm quite sure the whole thing can be solved with and O(size).

Note: I showed right shift, left shifting is quite the same thing

If `shift` is greater than byte size, you can first determine how many full bytes to shift and do it. So your bit shifting will be always less than the byte size. This way it will be `O(size)` — Eugene Sh., Dec 09 '21 at 15:26
The optimal way to do any task like this is hugely dependent on architecture and processor model and often on circumstances (cache state, other code executing at or near the same time, and so on). — Eric Postpischil, Dec 09 '21 at 15:32
`unsigned char` should be used for working with bits, not `char`, due to complications with signs. — Eric Postpischil, Dec 09 '21 at 15:32
`*(s + j) ^= (-take ^ *(s + j)) & (1 << (CHAR_BIT - 1));` is almost entirely unnecessary. `*(s + j)` is easier read and written as `s[j]`. This code merely flips the high bit of `s[j]` if `take` is true (assuming two’s complement; it may fail in sign-and-magnitude). But if `unsigned char` is used then the preceding `*(s + j) >>= 1;` has cleared the high bit, so we can simply use `s[j] |= take << CHAR_BIT-1;`. — Eric Postpischil, Dec 09 '21 at 15:35
It is inefficient to shift the bits one-by-one using a loop. Given some non-zero `shift` less than the width of a byte, `s[j] >> shift` gives the new low bits and `s[j] << CHAR_BIT-shift` gives the bits to pass as the next high bits. — Eric Postpischil, Dec 09 '21 at 15:37
@EricPostpischil yeah I also tried approaching the problem from that point of view. However I wasn't able to implement the case where `shift` is greater than the bits of a `char` — Giuppox, Dec 09 '21 at 16:01
@EugeneSh. I'm not sure I understand, can you show an implementation? — Giuppox, Dec 09 '21 at 16:03

score 1 · Accepted Answer · answered Dec 09 '21 at 19:14

1

First of all, you do not need the shift loop. You can first do a modulus: int n = shift % CHAR_BIT; and a division int m = shift / CHAR_BIT;. Indeed, rotating a char variable k * CHAR_BIT times is equivalent to move directly bytes by k. Moreover, you you can directly exact the n least significant bits of *(s + j) using int pass = *(s + j) & ((1 << n) - 1);. Then, a shift of n bits can be directly with a simple left shift (*(s + j) >>= n;). Finally, the "merge" of pass with the previously left shifted value can be done using *(s + j) |= pass << (CHAR_BIT - n));. Note that the previous operations work on unsigned char variables that are safer (than a signed char) in this case since a right shift of a negative signed type is implementation defined. This makes the algorithm running in O(size) time.

An optimisation is to not do anything if shift is 0 and otherwise use memmove when n is 0.

Another optimization is to work on several char items at the same time using bigger types like uint64_t. However, this assume CHAR_BIT is 8 which is the case on almost all (sane) modern processor in the world. However, you should be very careful with this optimization since the loaded/stored values need to be correctly aligned and the type punning must be safe (using either memcpy or a unions) so to not break the strict aliasing rule (resulting in an undefined behaviour).

An alternative optimization is to use SIMD intrinsics (like SSE, AVX2 on modern x86-64 processors or NEON on most ARM processors). This optimization is more efficient than working on bigger types and much safer. However, using intrinsics requires advanced programming skills and make the code less portable and often barely readable. SSE/NEON can work simultaneously on 16 8-bit items per cycle and AVX2 up to 32 8-bit items resulting in a drastically faster code. Note that compilers can sometimes do that automatically for you (assuming optimization flags are enabled).

Note that writing s[j] instead of *(s + j) is easier to read and shorter. Also note that the expression -take ^ *(s + j) certainly results in an implementation defined behaviour since -1 can be represented differently on different architectures (see two's complement and one's complement) although almost all modern processors use the two's complement.

answered Dec 09 '21 at 19:14

Jérôme Richard

41,678
6
29
59

for intrinsics, is it better to work on less elements of bigger size or on many items of small size? – Giuppox Dec 10 '21 at 18:57
1

SIMD vectors have generally a fixed-size (in bytes) independently of the size of the items to be computed. For example, AVX2 (256-bits wide) can compute simultaneously 32 x 8-bit integers, or 8 x 32-bit integers or even 4 x 64-bit integers. Thus, it is better to work directly on 8-bit elements as using bigger types is much more dangerous (it only matters when you cannot/do-not-want-to use SIMD intrinsics). Note that, there are some rare architectures with variable-sized SIMD vector but they are subdivided in fixed-size vectors internally by the processor. – Jérôme Richard Dec 10 '21 at 19:05
Gotcha. What if my `size` is smaller that 32? It is still worth to use intrinsics? – Giuppox Dec 10 '21 at 19:13
Yes, it often does not worth it due to the relatively big latency (compared to scalar operations). Recent instruction sets like AVX-512/SVE provide new features which help a bit. – Jérôme Richard Dec 10 '21 at 19:42
Can you show an implementation of shifting using intrinsics? I have managed to make the whole thing work with the algorithm you wrote, however i'm facing problems trying to implement the whole thing with SIMD. As [here](http://wwwuser.gwdg.de/~parallel/intel_compiler_doc_91/main_cls/mergedProjects/intref_cls/common/intref_sse2_int_shift.htm#srli_si128) I haven't found any suitable instruction. Maybe `_mm_srli_si128`, but apparently byte shifts, doesn't bit shift. [Github](https://github.com/Giuppox/block/blob/02557aedba2dca88ad29514cfebeeeed5cf7e770/block/core/src/generic.c#L94-L119) – Giuppox Dec 18 '21 at 10:22
A fully working fast SIMD implementation is a bit tricky and tedious to implement (due to the many corder-case an no direct instruction for this). `_mm_srli_si128` indeed shift bits but by bucket of 8 contiguous bits. This is likely not what you want. At least, not directly, but it can help. You can use `_mm_alignr_pi8` which is similar but I think better for your case. I am surprised there is apparently no SSE instruction to shift 8-bits integers. You have to work with bigger ones using `_mm_srli_epi32` for example. Consequently, you unfortunately need to care about endianess... – Jérôme Richard Dec 18 '21 at 13:29
Indeed, x86 processors are little-endian, so you need to reverse the order of the octets first using a `_mm_shuffle_epi8`, the perform your operation and then do the same shuffle again. As for the shift between 32-bit buckets, you can use a shuffle and a blend (`_mm_blendv_epi8`). The trickiest part is this last part. It is also the one that will impact the most performance. Because of all the intermediate operations, SSE should only just be a bit faster than using 64-bit integers. AVX should help to significantly improve performance, but it introduces lanes that are a mess for developers. – Jérôme Richard Dec 18 '21 at 13:39
AVX2 introduces permutes instructions that are great to shuffle items across lanes (most instruction cannot cross lanes). However, Intel (intentionally?) missed to provide a 8-bit and 16-bit permute instruction. As a result, 8-bit shuffles are cumbersome to perform and you will certainly get some headache if you try for the first one ;) . The solution is to mix 8-bit shuffles, 32-bit permutes and blend instructions. – Jérôme Richard Dec 18 '21 at 13:45
Thanks a lot, that's quite a lot of information to metabolize, let's see if I can figure something out ;) – Giuppox Dec 19 '21 at 10:34

Best way to implement bit shifting over a block of memory

1 Answers1