SIMD: Bit-pack signed integers

Question

Unsigned integers can be compressed by using "bit-packing" techniques: Within a block of unsigned integers only the significant bits are stored, resulting in data compression when all integers in a block are "small". The method is known as FOR (frame of reference).

There are SIMD libraries that do this very efficiently.

Now I want to use FOR-like techniques to encode signed integers, e.g. from a differenced sequence of unsorted unsigned integers. The sign of each signed integer needs to be stored somewhere, there are two options:

Store the signs in a separate block of data. This adds overhead.
Store the sign together with the absolute value of each signed integer.

I'm following path 2 right now. The 2-s complement has the sign bit in the msb (most signfificant bit), so that won't work for bit-packing à la FOR. One possibility is to store the sign in the lsb (least significant bit). Storing signed integers this way is very unusual, there are no instruction that support this, as far as I know. The question now is: Can these lsb-signed-integers be encoded/decoded efficiently using SIMD instructions?

I think AVX-512 _mm_testn_epi32_mask can be used to extract the lsb from each uint32, followed by a shift, then two mask_extract of some sort? Quite convoluted.

A common and reversible mapping from signed to unsigned integers is: `(x < 0) ? (-2 * x - 1) : (2 * x)`. Could you make use of that? — njuffa, Apr 14 '21 at 08:56
AVX-512 has SIMD rotate of 32-bit elements, if that helps. (https://www.felixcloutier.com/x86/vprold:vprolvd:vprolq:vprolvq). So you could bring the sign-bit to the bottom. But you'd then I guess want to mask away all the other leading ones for a negative number. Perhaps you can bit-flip or negate for negative inputs, so end up with leading zeros, and can use the low bit as a signal to negate or not when decoding. (`0-x` after a rotate has the advantage of leaving the low bit set for negative numbers, unlike `x ^ -1`.) I think this is the same mapping @njuffa suggested. — Peter Cordes, Apr 14 '21 at 08:56
Yeah, so encode with `vprold zmm, zmm, 1` / `vpabsd zmm, zmm` (absolute value of 32-bit elements). Decode with ... I'll post an answer when I think of something, if nobody beats me to it :P — Peter Cordes, Apr 14 '21 at 09:05
@njuffa The positive and negative integers become interleaved, basically. Good idea. — mcmayer, Apr 14 '21 at 09:15
A related question was asked 10 years ago: https://stackoverflow.com/questions/3557599/whats-the-faster-way-to-implement-the-signed-to-unsigned-map-rx-2x-1-if-x0 — mcmayer, Apr 14 '21 at 09:21
I guess `vpabsd` after rotate would decide based on the *new* sign bit, not the old. So it's only safe if your numbers fit in 31-bit 2's complement signed integers. If you can't make that assumption, encoding won't be quite *that* cheap, e.g. `pabsd` *first* and then shift / OR in the original sign bit. (4 instructions, and only requires SSE2 not AVX-512. Or still 2 with AVX-512 VBMI2 for `vpshld` double-register shift). Unlike pure C, SIMD intrinsics give you easy access to highly efficient absolute value, and (with AVX512) to rotates. — Peter Cordes, Apr 14 '21 at 09:48
To implement @njuffa's suggestion, note that `-x - 1` is the same as `~x`, so `(x < 0) ? (-2 * x - 1) : (2 * x)` is the same as `(x>>31) ^ (x+x)`. — chtz, Apr 14 '21 at 10:01
@chtz: Oh right, where `>>` is an *arithmetic* right shift (`vpsrad` or `vpsraw`). So yeah, 3 instructions. — Peter Cordes, Apr 14 '21 at 10:04

aqrit · Accepted Answer · 2021-04-15T02:44:52.970

Untested ZigZag examples in C using SSE2 for 64-bit integers:

(note: SSE2 is missing some 64-bit instructions...)

#include <emmintrin.h>


// from comment by Peter-Cordes 
__m128i zigzag_encode_epi64(__m128i v) {
    __m128i signmask = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,3,1,1));
    signmask = _mm_srai_epi32(signmask, 31);
    return _mm_xor_si128(_mm_add_epi64(v, v), signmask);
}


__m128i zigzag_decode_epi64 (__m128i v) {
    __m128i signmask = _mm_and_si128(_mm_set_epi32(0, 1, 0, 1), v);
    signmask = _mm_sub_epi64(_mm_setzero_si128(), signmask);
    return _mm_xor_si128(_mm_srli_epi64(v, 1), signmask);
}


// no constant
__m128i zigzag_decodev3_epi64 (__m128i v) {
    __m128i t = _mm_srli_epi64(v, 1);
    __m128i signmask = _mm_sub_epi64(_mm_slli_epi64(t, 1), v);
    return _mm_xor_si128(t, signmask);
}

Zigzag is good for bitwise varints. However, a bytewise group-varint may wish to "sign extend from a variable bit-width".

32-bit examples

I favored compares over arithmetic shifts. I assume - when unrolled - compares will have 1 cycle lower latency.

__m128i zigzag_encode_epi32 (__m128i v) {
    __m128i signmask =_mm_cmpgt_epi32(_mm_setzero_si128(), v);
    return _mm_xor_si128(_mm_add_epi32(v, v), signmask);
}


__m128i zigzag_decode_epi32 (__m128i v) {
    const __m128i m = _mm_set1_epi32(1);
    __m128i signmask =_mm_cmpeq_epi32(_mm_and_si128(m, v), m);
    return _mm_xor_si128(_mm_srli_epi32(v, 1), signmask);
}


__m128i delta_encode_epi32 (__m128i v, __m128i prev) {
    return _mm_sub_epi32(v, _mm_alignr_epi8(v, prev, 12));
}


// prefix sum (see many of answers around stackoverflow...)
__m128i delta_decode_epi32 (__m128i v, __m128i prev) {
    prev = _mm_shuffle_epi32(prev, _MM_SHUFFLE(3,3,3,3)); // [P P P P]
    v = _mm_add_epi32(v, _mm_slli_si128(v, 4)); // [A AB BC CD]
    prev = _mm_add_epi32(prev, v); // [PA PAB PBC PCD]
    v = _mm_slli_si128(v, 8); // [0 0 A AB]
    return _mm_add_epi32(prev, v); // [PA PAB PABC PABCD]
}


__m128i delta_zigzag_encode_epi32 (__m128i v, __m128i prev) {
    return zigzag_encode_epi32(delta_encode_epi32(v, prev));
}


__m128i delta_zigzag_decode_epi32 (__m128i v, __m128i prev) {
    return delta_decode_epi32(zigzag_decode_epi32(v), prev);
}

Note: Delta coding would be faster (round-trip/decoding) to transpose the elements while encoding then transpose them back again during decoding; horizontal prefix sums are really slow. However, determining the optimum number of elements to transpose in each batch seems like a hard problem.

You might avoid needing a zeroed register (and a move) for encode if you emulate the non-existent `psraq` as `pshufd` + `psrad`. https://godbolt.org/z/xjd9Kozjh has that and an SSE4 decode, and an alternate SSE2 decode that doesn't need a constant. (More shifts-port pressure, but fewer total front-end uops when you count the movdqa that's replaced by a copy-and-shuffle.) — Peter Cordes, Apr 14 '21 at 22:23
Note that clang optimized your original encode into better asm than this, when AVX or SSE4 was available. So perhaps worth writing a manual SSE4.2 `pcmpgtq` version that all compilers can use when it's available, even ones that aren't as good at optimizing SIMD intrinsics as clang is. Note that OP even has AVX-512 available so you even have arithmetic qword right shift, or test-into-mask for a merge-masked sub-from-0. — Peter Cordes, Apr 15 '21 at 01:41
*I assume - when unrolled - compares will have 1 cycle lower latency.* - Why? Immediate-count shifts like `psrad` have the same latency (usually 1c) as `pand` or `pcmpgtd` on all microarchitectures Agner Fog has tested except AMD K10. Were you maybe looking at latency for `psrad xmm, xmm`, which is a cycle slower on some (like Bulldozer)? — Peter Cordes, Apr 15 '21 at 03:00
Or do you mean because it requires movdqa + psrad, instead of an xor-zeroing off the critical path? That is a factor on CPUs too old for XMM mov-elimination (which first appeared in IvyBridge and Bulldozer). But despite your choice of intrinsics, GCC and clang both optimize that compare into a `movdqa` + `psrad xmm, 31`! https://godbolt.org/z/W5orrojWb. Clang even turns the `== 1` compare into `v <<31 >>31`. GCC stupidly duplicates the same constant twice. With luck the linker will merge it :/ — Peter Cordes, Apr 15 '21 at 03:05

SIMD: Bit-pack signed integers

1 Answers1

Linked