2

I am given a string of the following form:

  • Each line contains two ints seperated by a single space. The line ending is a single "\n"
  • The number of lines is a multiple of 2
  • The ints are of a nice form: They are all positive, have no trailing zeros and no '+' or '-' and all have 1 to 7 digits

An example would be:

"5531 1278372\n461722 1278373\n1022606 1278374\n224406 1278375\n1218709 1278376\n195903 1278377\n604672 1278378\n998322 1278379\n"

I have a pointer to the beginning as well as to the ending of the string.

I want to parse this string by extracting all the integers from it as fast as possible. The first idea that comes to mind is using a loop in which we always extract the first integer of the string using sse and advance the pointer to the start of the next integer (which in this case is two characters after the end of the string, since all delimiters have size 1). As I have a pointer to the end of the string, the function that reads the first int of the string would not have to check for '\0' but only gets called when there really is another integer in the string. One could for example adapt the solution from How to implement atoi using SIMD? to obtain the following function, which returns the first integer of the string and then advances the pointer to after the delimiter after the int (so it points to the beginning of the next int):

inline uint32_t strToUintSSE(char*& sta) {
            //Set up constants
            __m128i zero        = _mm_setzero_si128();
            __m128i multiplier1 = _mm_set_epi16(1000,100,10,1,1000,100,10,1);
            __m128i multiplier2 = _mm_set_epi32(0, 100000000, 10000, 1);
            
            //Compute length of string
            __m128i string      = _mm_lddqu_si128((__m128i*)sta);
            
            __m128i digitRange  = _mm_setr_epi8('0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0);
            int len = _mm_cmpistri(digitRange, string, _SIDD_UBYTE_OPS | _SIDD_CMP_RANGES | _SIDD_NEGATIVE_POLARITY);

            
            sta += len + 1;
            
            //Reverse order of number
            __m128i permutationMask = _mm_set1_epi8(len);
            permutationMask = _mm_add_epi8(permutationMask, _mm_set_epi8(-16,-15,-14,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1));
            string = _mm_shuffle_epi8(string, permutationMask);

            //Shift string down
            __m128i zeroChar = _mm_set1_epi8('0');
            string = _mm_subs_epu8(string, zeroChar);

            //Multiply with right power of 10 and add up
            __m128i stringLo = _mm_unpacklo_epi8(string, zero);
            __m128i stringHi = _mm_unpackhi_epi8(string, zero);
            stringLo = _mm_madd_epi16(stringLo, multiplier1);
            stringHi = _mm_madd_epi16(stringHi, multiplier1);
            __m128i intermediate = _mm_hadd_epi32(stringLo, stringHi);
            intermediate = _mm_mullo_epi32(intermediate, multiplier2);

            //Hadd the rest up
            intermediate = _mm_add_epi32(intermediate, _mm_shuffle_epi32(intermediate, 0b11101110));
            intermediate = _mm_add_epi32(intermediate, _mm_shuffle_epi32(intermediate, 0b01010101));

            return _mm_cvtsi128_si32(intermediate);
        }

Also since we no that the string only contains '0'-'9', ' ' and '\n' we can calculate len using

int len = _mm_tzcnt_32(_mm_movemask_epi8(_mm_cmpgt_epi8(zeroChar, string)));

However, the requirements imply that a XMM register always fits two integers, so I would like to modify the function to extract both of them from "string". The idea is to transform "string" so that the first int starts at byte 0 and the second int starts at byte 8. Before, we reversed the digits, since at the moment we added zeros to the end of the number, making it bigger. However we want to make the zeros trainling zeros which is done by reversing. Another possibility would be to have the first int end at byte 7 (inclusive) and the second at byte 15, so we essentially aligned them with the right of their respective half of the register. This way the zeros are also in the higher digits of the number. To summarize: If we e.g. have the string "2035_71582\n" (i'm using '_' to visualize the ' ' better), we want the XMM register to look like

  • '5','3','0','2',0,0,0,0,'2','8','5','1','7',0,0,0

  • 0,0,0,0,'2','0','3','5',0,0,0,'7','1','5','8','2'

Note: These possibilities are the same but each half is reversed

(Of course multiplying by the right power of 10 and then adding the digits up can also be optimized since we now only have 7 digits instead of 16)

To perform this transformation, we must first extract the length of the two integers. This can be done with

inz mask = _mm_movemask_epi8(_mm_cmpgt_epi8(zeroChar, string)); //Instead of _mm_cmpistrmint len1 = _mm_tzcnt_32(mask);
int combinedLen = _mm_tzcnt_32(mask & (mask-1)); //Clears the lowest bit of mask first, will probably emit a BLSR

To implement the transform, I could think of multiple different ways:

  • Use a shuffle like before. One could try to compute the mask like this:

    __m128i permutationMask = _mm_setr_epi8(len1, len1, len1, len1, combinedLen, combinedLen, combinedLen, combinedLen);
    permutationMask = _mm_add_epi8(permutationMask, _mm_set_epi8(-8,-7,-6,-5,-4,-3,-2,-1,-8,-7,-6,-5,-4,-3,-2,-1));
    

    However, this runs into the problem that when reversing the second int, we run backwards into the first int: e.g. "2035_71582\n" -> '5','3','0','2',0,0,0,0,'2','8','5','1','7',' ','5','3' (we have an extra 53 from the first int at the end).

    If we right shift instead of reversing we can compute the mask analogously (only reverse the summand)

    __m128i permutationMask = _mm_setr_epi8(len1, len1, len1, len1, combinedLen, combinedLen, combinedLen, combinedLen);
    permutationMask = _mm_add_epi8(permutationMask, _mm_setr_epi8(-8,-7,-6,-5,-4,-3,-2,-1,-8,-7,-6,-5,-4,-3,-2,-1));
    

    but run into the same problem: "2035_71582\n" -> 0,0,0,0,'2','0','3','5', '3','5',' ','7','1','5','8','2'

    It seems to me, that computing a good shuffle mask is pretty hard to do. Maybe the best solution with this approach would be to first use a shuffle and then zero out the wrong bytes (there are many possibilities) for this)

  • Instead of a shuffle, use two pslldq to shift the ints to the right and then combine them (one is upper half, one is lower half) for example using a blend. However one would still need to zero out bytes as the first int would also possibly appear in the second half.

  • Use a gather. However we would still need to zero out the wrong bytes

  • Something different entirely, e.g. using AVX512-VBMI (vpexpandb or vpcompressb maybe?). Maybe one wouldn't even have to compute len1 and combinedLen but could use the mask directly?

The first 3 don't feel very optimal yet while I have no clue about the last. Can you think of a good way to do this? This can also be extended to using YMM registers to parse 4 ints at once (or even ZMM for 8 ints) which complicates things again, since the first 2 approaches become infeasible due to the inability to shuffle/shift across the 128bit lines, so the last approach looks the most promising to me. Sadly, I don't really have any experience with AVX512. You are free to use any version of SSE, AVX, AVX2 - and also AVX512 as a last resort (I can't run AVX512 but if you find a nice solution with it, I would be interested as well).

JulianV
  • 21
  • 3
  • You can load 8 bytes each for the permutation mask from an uint8-array `{0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0, 1, 2, 3, 4, 5, 6}`, starting at the string-length of each integer. For the upper part of the mask you also need to add the start index of the second integer. – chtz Nov 13 '22 at 16:35

1 Answers1

1

Here's a strategy from over here.

Other References:

Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number?

How to find the position of the only-set-bit in a 64-bit value using bit manipulation efficiently?

See also:

http://0x80.pl/articles/simd-parsing-int-sequences.html

#include <tmmintrin.h> // SSSE3
#include <stdint.h>

static inline
uint64_t swar_parsedigits (uint8_t* src, uint32_t* res) {
    // assumes digit group len max is 7
    // assumes each group is separated by a single space or '\n'    
    uint64_t v;
    memcpy(&v, src, 8); // assumes little endian
    v -= 0x3030303030303030ULL; 
    uint64_t t = v & 0x8080808080808080ULL; // assumes "valid" input...
    uint64_t next = ((t & (-t)) * 0x20406080a0c0e1ULL) >> 60;
    v <<= (9 - next) * 8; // shift off trash chars
    v = ((v * 0x0000000000000A01ULL) >> 8) & 0x00FF00FF00FF00FFULL;
    v = ((v * 0x0000000000640001ULL) >> 16) & 0x0000FFFF0000FFFFULL;
    v = (v * 0x0000271000000001ULL) >> 32;
    *res = v;
    return next;
}


static inline
uint64_t ssse3_parsedigits (uint8_t* src, uint32_t* res) {
    // assumes digit group len max is 7
    // assumes each group is separated by a single space or '\n'
    const __m128i mul1 = _mm_set1_epi64x(0x010A0A6414C82800);
    const __m128i mul2 = _mm_set1_epi64x(0x0001000A01F461A8);
    const __m128i x30 = _mm_set1_epi8(0x30);
    __m128i v;

    // get delimiters
    v = _mm_loadu_si128((__m128i *)(void *)src);
    v = _mm_sub_epi8(v, x30);
    uint32_t m = _mm_movemask_epi8(v);

    // find first 2 group lengths
    int len0 = __builtin_ctzl(m);
    m &= m - 1; // clear the lowest set bit
    int next = __builtin_ctzl(m);
    int len1 = next - (len0 + 1);

    // gather groups
    uint64_t x0, x1;
    memcpy(&x0, src, 8);
    memcpy(&x1, &src[len0 + 1], 8);
    
    // pad out to 8 bytes
    x0 <<= (8 - len0) * 8;
    x1 <<= (8 - len1) * 8;

    // back into the xmm register...
    v = _mm_set_epi64x(x1, x0);
    v = _mm_subs_epu8(v, x30); 
    v = _mm_madd_epi16(_mm_maddubs_epi16(mul1, v), mul2);
    v = _mm_hadd_epi32(v, v);
    
    _mm_storel_epi64((__m128i*)(void *)res, v);
    return next + 1;
}
aqrit
  • 792
  • 4
  • 14
  • SSSE3 is not AVX2 or AVX-512 – phuclv Nov 16 '22 at 00:57
  • @phuclv - The question was tagged sse/avx and said "You are free to use any version of SSE, AVX, AVX2 - and also AVX512 as a last resort". SSSE3 _is_ a subset of AVX2. I also linked to a version using "AVX2" 256-bit ymmwords. – aqrit Dec 15 '22 at 17:21