Decompress varint64 stream of variable-length integers into the qword elements of an __m256i with AVX2

Question

Varint64 data is varlen data format.

Varint64: Divide uint64 into 8 8bits, each 8bits contains the highest flag bit and the lower 7 data bits. When the flag bit is 1, it indicates that the next 8bit still belongs to this uint64, otherwise it is the highest 8bit of this uint64. https://developers.google.com/protocol-buffers/docs/encoding

data stream like:
uint8_t p[32] = {128, 129,130,131,132, 133, 0, 128, 129,130,131,132, 134, 0,131,132, 133, 134, 0, 128, 129,130,131,132, 133, 134, 0,128,129,130,131};

__m256i like:
__m256i split 4*4 m64i:
m64i[0] = p[0]:p[6]
m64i[1] = p[7]:p[13]
m64i[2] = p[14]:p[18]
m64i[3] = p[19]:p[26]

I want to use AVX2/SSE to parse varint64. First, I need to align the raw stream to __m256i in order to do the next step. I find that aligning data is very time-consuming. Is there any good way to quickly get the data for the first four varint64 elements from the byte-stream into the 64-bit elements of an __m256i?

my fastest align code:

#define SetBitMask(x) ((x) >= 8 ? 0xFFFFFFFFFFFFFFFFULL : ((1ULL << ((x)<<3)) -1 ))
inline __m256i _mm256_align_epi64_2(const uint8_t* p) {
    auto b = _mm256_loadu_si256((__m256i*)(p));
    auto bitmask = _mm256_movemask_epi8(b);
    auto bm_not = ~bitmask;
    auto first_len = __builtin_ctz(bm_not) + 1;
    bm_not = bm_not >> first_len;
    auto second_len = __builtin_ctz(bm_not) + 1;
    bm_not = bm_not >> second_len;
    auto third_len = __builtin_ctz(bm_not) + 1;
    bm_not = bm_not >> third_len;
    auto fourth_len = __builtin_ctz(bm_not) + 1;

    auto n1 = (*(uint64_t*)(p+=0)) & SetBitMask(first_len);
    auto n2 = (*(uint64_t*)(p+=first_len)) & SetBitMask(second_len);
    auto n3 = (*(uint64_t*)(p+=second_len)) & SetBitMask(third_len);
    auto n4 = (*(uint64_t*)(p+=third_len)) & SetBitMask(fourth_len);
    return _mm256_set_epi64x(n1, n2, n3, n4);
}

cat /proc/cpuinfo
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca


model name  : AMD EPYC 7642 48-Core Processor

Historical reasons and the cost of joint debugging with external departments incurred by format conversion @Mgetz — Songmeng, Jun 23 '21 at 14:53
@Mgetz thanks for reminding, I think the version of the version implemented by sse can also be. — Songmeng, Jun 23 '21 at 14:58
AVX2 implies SSE is available, it doesn't need an extra tag. The tags constrain the answers. So the current tags will actually provide a wider array of possible answers. — Mgetz, Jun 23 '21 at 14:59
you do however need the C++ tag... both for syntax highlighting and because your code is C++. — Mgetz, Jun 23 '21 at 15:02
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234119/discussion-between-songmeng-and-mgetz). — Songmeng, Jun 23 '21 at 15:06
The first minor optimization would be: instead of shifting after each CTZ, just clear the lowest set bit, so you have the actual position of each element separately instead of needing a running total sum to recover them. AVX2 usually implies BMI2, so `bm_not = _blsr_u32(bm_not);` or just let the compiler do that peephole optimization for you with `bm_not &= bm_not - 1;` — Peter Cordes, Jun 23 '21 at 15:22
I wonder if BMI2 `pdep` could be useful for the mask generation, since you don't have AVX-512 `vpexpandd` (err I guess AVX512-VBMI2 for `vpexpandb`). But `pdep` is only fast on Intel, very slow on AMD. Is your use-case perhaps limited to Intel servers, or do you need something that's also good on AMD? — Peter Cordes, Jun 23 '21 at 15:26
When you say "align", you mean "decode", right? You're not talking about being about to use `_mm256_load` instead of `loadu`, but instead about getting four varints into four elements of an `__m256i`. Obviously in a loop, you'd want to just use unaligned loads, not copy to an aligned buffer (how? AVX2 unaligned loads would be the fastest way to do that, and storing them just so you could do an aligned reload wouldn't be helpful). — Peter Cordes, Jun 23 '21 at 15:30
@PeterCordes Sorry, I didn't clarify my environment. I have added an environmental description in the text. AMD and not AVX512. — Songmeng, Jun 23 '21 at 15:31
Ok, so `pdep` isn't worth looking at, except for future readers with Intel CPUs. (Or AMD Zen3, which makes pext/pdep fast). — Peter Cordes, Jun 23 '21 at 15:34
@PeterCordes I think the first step of decoding is to align the tightly stored varint64 to __m256i according to 8 bytes, and then start the subsequent decoding work, such as removing the high-order 1 and so on. — Songmeng, Jun 23 '21 at 15:40
That's not what the word "align" normally means when talking about SIMD. The first comment you got (now deleted) was assuming you were talking about memory alignment, like C++ `alignas`. I think "align" is more confusing than helpful; perhaps "unpack" could work, but "decompress" also works (although you're really only asking about the unprocessed data, without removing the signalling 1 bits). I added "qwords elements of" to the title to make it more explicit what you're doing. — Peter Cordes, Jun 24 '21 at 04:54
BTW, `(*(uint64_t*)...` is strict-aliasing and alignment undefined behaviour. You could `_mm_loadu_si64(p)` or `_mm_loadl_epi64( (__m128i*)p )` for `movq` / `movhps` (or `pinsrq`) loads, and manually shuffle together the 128-bit halves. Or with GNU C, typedef an `__attribute__((aligned(1), may_alias))` version of `uint64_t` that you can safely point anywhere and dereference. Or use `memcpy`. See [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) — Peter Cordes, Jun 24 '21 at 05:06
partial answer: first 2 bit-positions -> 8x8 LUT of `__m128i` pshufb masks (which can do zeroing as well). 1024 bytes. Do the same thing for the second two bit positions, into the high half of an `__m256i` via `vinserti128` loads of the data and the shuffle-control to feed `_mm256_shuffle_epi8`. Will post an answer if/when I get around to it, unless someone else wants to take that idea and run with it first. — Peter Cordes, Jul 04 '21 at 02:46

Decompress varint64 stream of variable-length integers into the qword elements of an __m256i with AVX2

0 Answers0