2

I have 32 length-1-to-4 strings stored in AVX2 uint8x32 registers, one register for each of length, byte0, byte1, byte2, byte3. I'd like to concatenate all the strings and write them densely to memory. If all the strings were equal length this would be straightforward: I'd shuffle the bytes to their target positions using pshufb and use some blend calls to mix the byte0/byte1/byte2/byte3 registers together. (Alternatively perhaps I could use vpunpck* instructions. Not yet figured out...)

However, the variable-length aspect makes this harder: where each output byte comes from is now a nontrivial function of the lengths. I can't figure out how to implement this efficiently in AVX2 code. Help?

Bottom line: I'd like a reimplementation of the following function, written in (as fast as possible) vector code rather than scalar code:

(godbolt link)


int concat_strings(char* dst, __m256i len_v, __m256i byte0_v, __m256i byte1_v, __m256i byte2_v, __m256i byte3_v) {
    char len[32];
    char byte0[32];
    char byte1[32];
    char byte2[32];
    char byte3[32];
    _mm256_store_si256(reinterpret_cast<__m256i*>(len), len_v);
    _mm256_store_si256(reinterpret_cast<__m256i*>(byte0), byte0_v);
    _mm256_store_si256(reinterpret_cast<__m256i*>(byte1), byte1_v);
    _mm256_store_si256(reinterpret_cast<__m256i*>(byte2), byte2_v);
    _mm256_store_si256(reinterpret_cast<__m256i*>(byte3), byte3_v);

    int pos = 0;
    for (int i = 0; i < 32; ++i) {
        dst[pos + 0] = byte0[i];
        dst[pos + 1] = byte1[i];
        dst[pos + 2] = byte2[i];
        dst[pos + 3] = byte3[i];
        pos += len[i];
    }
    return pos;
}

Help?

rp123
  • 3,623
  • 1
  • 18
  • 20
  • Somewhat related: [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) for fixed-width 32-bit elements based on a per-element predicate (look up a pshufb control vec vs. compute one with `pext` which you probably have along with AVX2). But your string data has byte granularity and is interleaved across vectors so any wide vector store will need some shuffling (yes `vpunpck*` + `vpermq` or similar). Maybe doing something conditional *during* that, based on `vpcmpgtb len, set1(1)`? – Peter Cordes May 10 '20 at 10:23
  • BTW, your string format seems fairly inconvenient to deal with. Unless that's really good for something else you're doing with them, could you make the length implicit? Like have bytes 1..3 be `0` for length=1 strings. Or no bytes `0` for a length 4 string. (IDK if that helps, and a large lookup table of shuffle-control vectors would suck because of cache misses.) – Peter Cordes May 10 '20 at 10:28
  • Perhaps explicit lengths do help. The final position of any given byte is the prefix-sum of the preceding lengths, plus its offset within its string. [parallel prefix (cumulative) sum with SSE](https://stackoverflow.com/q/19494114) shows some ideas for shuffling. For horizontal sums of bytes in general, `psadbw` against a vector of `setzero()` hsums in 8-byte groups. If you can turn that into `pshufb` masks and then blend in the right order, trailing bytes bytes of short strings might get overwritten for free? – Peter Cordes May 10 '20 at 16:09
  • Probably working in 128-bit chunks is a good bet, or independently in high halves and recombine later with a memcpy? – Peter Cordes May 10 '20 at 16:09
  • After `vpunpck`-ing 4 strings each into a 16 byte subvector, another candidate might be a sequence of `vps[l,r]lv[d,q]` with some cleverly calculated indices. This could compress 4 strings to a consecutive area inside a register-half, but I'm afraid there is no 128-bit shift with non-immediate shift-count. – chtz May 10 '20 at 17:29
  • Thanks @PeterCordes, I can see how to make `pext` work. No luck yet with the other techniques. – rp123 Jun 07 '20 at 19:18

0 Answers0