SIMD string to unsigned int parsing in C# performance improvement

Question

I've implemented a method for parsing an unsigned integer string of length <= 8 using SIMD intrinsics available in .NET as follows:

public unsafe static uint ParseUint(string text)
{
  fixed (char* c = text)
  {
    var parsed = Sse3.LoadDquVector128((byte*) c);
    var shift = (8 - text.Length) * 2;
    var shifted = Sse2.ShiftLeftLogical128BitLane(parsed, 
      (byte) (shift));

    Vector128<byte> digit0 = Vector128.Create((byte) '0');
    var reduced = Sse2.SubtractSaturate(shifted, digit0);

    var shortMult = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
    var collapsed2 = Sse2.MultiplyAddAdjacent(reduced.As<byte, short>(), shortMult);

    var repack = Sse41.PackUnsignedSaturate(collapsed2, collapsed2);
    var intMult = Vector128.Create((short)0, 0, 0, 0, 100, 1, 100, 1);
    var collapsed3 = Sse2.MultiplyAddAdjacent(repack.As<ushort,short>(), intMult);

    var e1 = collapsed3.GetElement(2);
    var e2 = collapsed3.GetElement(3);
    return (uint) (e1 * 10000 + e2);
  }
}

Sadly, a comparison with a baseline uint.Parse() gives the following, rather unimpressive, result:

Method	Mean	Error	StdDev
Baseline	15.157 ns	0.0325 ns	0.0304 ns
ParseSimd	3.269 ns	0.0115 ns	0.0102 ns

What are some of the ways the above code can be improved? My particular areas of concern are:

The way a bit shift of the SIMD register happens with a calculation involving text.Length
~~The unpacking of UTF-16 data using a MultiplyAddAdjacent involving a vector of 0s and 1~~
The way elements are extracted using GetElement() -- maybe there's some ToScalar() call that can happen somwehere?

Are you calling almost 5x improvement “rather unimpressive result”? :-) — Soonts, Feb 25 '21 at 16:08
Move the constant values (like `intMult`) outrside the procedure such as `static readonly Vector128` fields. — JAlex, Feb 25 '21 at 16:43
You should be able to use SSSE3 `pmaddubsw` to multiply bytes and add horizontally into words (shorts). https://www.felixcloutier.com/x86/pmaddubsw. Except I think your data is starting as UTF-16, so nevermind. Related (for UTF-8 / ASCII): [How to implement atoi using SIMD?](https://stackoverflow.com/q/35127060) detects the end of the number and shuffles accordingly, so it's probably pretty slow compared to scalar for numbers only a couple digits long. — Peter Cordes, Feb 26 '21 at 06:29
@PeterCordes ugh, `MultiplyAddAdjacent` for bytes is, in fact, `pmadddubsw`, except that I don't need it here because I can treat UTF-16 as `short`s — Dmitri Nesteruk, Feb 26 '21 at 06:36
`LoadDquVector128` aka `LDDQU` is a weird relic of Pentium 4, and only had a niche use there. — harold, Mar 03 '21 at 15:57
For clarification: The string is stored as big-endian decimal without any invalid characters (i.e., you have up to 8 `uint16` with values between `'0'` and `'9'`)? And you can read beyond the length of `text` without issues? (Could you read before the start as well? Are there any guarantees on the data at these memory locations?) Packing from `uint16` to `uint8` is probably better done by `packuswb` (maybe that is what `.As()` does, I don't know C#) — chtz, Mar 05 '21 at 09:13

aepot · Answer 1 · 2021-03-06T08:30:46.003

I've made some optimizations.

public unsafe uint ParseUint2(string text)
{
    fixed (char* c = text)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
        raw = Sse2.ShiftLeftLogical128BitLane(raw, (byte)(8 - text.Length << 1));
        Vector128<ushort> digit0 = Vector128.Create('0');
        raw = Sse2.SubtractSaturate(raw, digit0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        res = Ssse3.HorizontalAdd(res, res);
        res = Ssse3.HorizontalAdd(res, res);
        return (uint)res.GetElement(0);
    }
}

Reduced amount of type conversions and final calculations were made with vphaddd. As result it's by ~10% faster.

But...imm8 must be a compile-time constant. It means you can't use a variable where imm8 is argument. Otherwise JIT compiler won't produce the intrinsic instruction for the operation. It will make an external method call at this place (maybe some workaround is there). Thanks @PeterCordes for help.

This monster isn't significantly but faster than above one, regardless of text.Length.

public unsafe uint ParseUint3(string text)
{
    fixed (char* c = text)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
        switch (text.Length)
        {
            case 0: raw = Vector128<ushort>.Zero; break;
            case 1: raw = Sse2.ShiftLeftLogical128BitLane(raw, 14); break;
            case 2: raw = Sse2.ShiftLeftLogical128BitLane(raw, 12); break;
            case 3: raw = Sse2.ShiftLeftLogical128BitLane(raw, 10); break;
            case 4: raw = Sse2.ShiftLeftLogical128BitLane(raw, 8); break;
            case 5: raw = Sse2.ShiftLeftLogical128BitLane(raw, 6); break;
            case 6: raw = Sse2.ShiftLeftLogical128BitLane(raw, 4); break;
            case 7: raw = Sse2.ShiftLeftLogical128BitLane(raw, 2); break;
        };
        Vector128<ushort> digit0 = Vector128.Create('0');
        raw = Sse2.SubtractSaturate(raw, digit0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        res = Ssse3.HorizontalAdd(res, res);
        res = Ssse3.HorizontalAdd(res, res);
        return (uint)res.GetElement(0);
    }
}

Again, @PeterCordes doesn't allow me to write a slow code. The following version got 2 improvements. Now string loaded already shifted, and then subtracted to the shifted mask by the same offset. This avoids the slow fallback for ShiftLeftLogical128BitLane with a variable count.
The second improvement is replacing vphaddd with pshufd + paddd.

// Note that this loads up to 14 bytes before the data part of the string.  (Or 16 for an empty string)
// This might or might not make it possible to read from an unmapped page and fault, beware.
public unsafe uint ParseUint4(string text)
{
    const string mask = "\xffff\xffff\xffff\xffff\xffff\xffff\xffff\xffff00000000";
    fixed (char* c = text, m = mask)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c - 8 + text.Length);
        Vector128<ushort> mask0 = Sse3.LoadDquVector128((ushort*)m + text.Length);
        raw = Sse2.SubtractSaturate(raw, mask0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        Vector128<int> shuf = Sse2.Shuffle(res, 0x1b); // 0 1 2 3 => 3 2 1 0
        res = Sse2.Add(shuf, res);
        shuf = Sse2.Shuffle(res, 0x41); // 0 1 2 3 => 1 0 3 2
        res = Sse2.Add(shuf, res);
        return (uint)res.GetElement(0);
    }
}

~Twice faster than initial solution. (o_O) At least on my Haswell i7.

Using `vphaddd` that way means there's room for further optimization with 2 fewer shuffle uops! [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) — Peter Cordes, Mar 05 '21 at 21:24
SSSE3 `palignr` (https://www.felixcloutier.com/x86/palignr) needs its shift-count as an immediate (compile-time constant). If this compiles, it might only work when inlined for compile-time-constant strings unless C# is doing something weird. Sse2.ShiftLeftLogical128BitLane (`pslldq` https://www.felixcloutier.com/x86/pslldq) has the same limitation, though. `palignr` needs a vector of zeros as input and thus costs an extra instruction to set that up. It also has larger code-size (longer machine code) so it's pretty much strictly worse for this use-case. — Peter Cordes, Mar 05 '21 at 21:28
Also, if you can use `Sse2.MultiplyAddAdjacent` (`pmaddwd`) like the question does, it's faster than `pmulld` (Sse41.MultiplyLow) on Haswell and later. (1 uop instead of 2). — Peter Cordes, Mar 05 '21 at 21:29
@PeterCordes `unless C# is doing something weird` - it does. There's no intrinsic in the JIT output but the method `call` instead. I was surprised but that. Thank you for the clarification of this suspisious for me thing. :) I'll look deeper and update the answer later, thanks. — aepot, Mar 05 '21 at 21:35
@PeterCordes Applying your suggestions I'm moving to OP's code version. But mine is obviously faster than initial. Looks like in C++ world OP's code must be faster but not in C#. You may look at JIT asm [here](https://sharplab.io/), just paste the C# code. Also you're right in both notes for `palignr`. — aepot, Mar 05 '21 at 22:12
I don't think it's C++ vs. C#. Possibly it's a factor of what CPU exactly you're using (Sandybridge vs. Skylake vs. Zen vs. something else). Also, I didn't claim that the OP's strategy of `(uint) (e1 * 10000 + e2);` was better than using SIMD horizontal sums, although it might be. But certainly 2x `pshufd` + 2x `paddd` should be faster than 2x `phaddd`, so you can implement your strategy (horizontal vector sum before GetElement(0)) faster with cheaper shuffles. — Peter Cordes, Mar 06 '21 at 02:04
If we can safely do an unaligned load that *ends* at the end of the string, we can avoid the variable shift. (We'd need to load a mask, or a constant for SubtractSaturate, with the same offset, to make sure we zero earlier elements. Like from an array of `0xffff, 0xffff, 0xffff, ..., 0x0020, 0x0020, ...` with the same 8-length offset or something, so low elements from outside the actual string get zeroed by `psubusw`). Of course, if loading from before the string could touch an unmapped page, that's a showstopper. — Peter Cordes, Mar 06 '21 at 02:10
@PeterCordes i really tried to load with negative offset to the pointer. There's at least 4 bytes containing string length exists but i loaded it even with -16 offset successfully. But i didn't realize how to zero redundant words properly. `Blend` was successful but there's the same `imm8` constant issue, with no difference in performance. Will try 2x pshufd + 2x paddd, thx! — aepot, Mar 06 '21 at 06:18
Instead of using a fixed `digit0` constant, load *that* from a sliding window of `0xffff` and `'0'` bytes. (I earlier wrote `0x20` but that was a mistake). Like in [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) but instead of loading an AND mask, you load a constant for `psubusw` that will zero the elements you want (because saturating unsigned sub can do that when you subtract 0xffff) — Peter Cordes, Mar 06 '21 at 06:33
@PeterCordes implemented, it's really fast now. Great thanks! — aepot, Mar 06 '21 at 08:06
Another micro-optimization: After the saturated subtraction, use `packuswb` to pack the numbers into the lower half, then do a `pmaddubsw` followed by `pmaddwd` and just one shuffle and addition for the final reduction. And if you wanted to convert a sequence of strings you could fuse the reduction steps of up to four reductions, of course. — chtz, Mar 06 '21 at 09:04

aqrit · Answer 2 · 2021-10-06T20:14:14.357

C# (thanks @aepot)

public unsafe uint ParseUint(string text)
{
    fixed (char* c = text)
    {
        Vector128<byte> mul1 = Vector128.Create(0x14C814C8, 0x010A0A64, 0, 0).AsByte();
        Vector128<short> mul2 = Vector128.Create(0x00FA61A8, 0x0001000A, 0, 0).AsInt16();
        Vector128<long> shift_amount = Sse2.ConvertScalarToVector128Int32(8 - text.Length << 3).AsInt64();

        Vector128<short> vs = Sse2.LoadVector128((short*)c);
        Vector128<byte> vb = Sse2.PackUnsignedSaturate(vs, vs);
        vb = Sse2.SubtractSaturate(vb, Vector128.Create((byte)'0'));
        vb = Sse2.ShiftLeftLogical(vb.AsInt64(), shift_amount).AsByte();

        Vector128<int> v = Sse2.MultiplyAddAdjacent(Ssse3.MultiplyAddAdjacent(mul1, vb.AsSByte()), mul2);
        v = Sse2.Add(Sse2.Add(v, v), Sse2.Shuffle(v, 1));
        return (uint)v.GetElement(0);
    }
}

C solution using SSSE3:

#include <uchar.h> // char16_t
#include <tmmintrin.h> // pmaddubsw

unsigned ParseUint(char16_t* ptr, size_t len) {
    const __m128i mul1 = _mm_set_epi32(0, 0, 0x010A0A64, 0x14C814C8);
    const __m128i mul2 = _mm_set_epi32(0, 0, 0x0001000A, 0x00FA61A8);
    const __m128i shift_amount = _mm_cvtsi32_si128((8 - len) * 8);

    __m128i v = _mm_loadu_si128((__m128i*)ptr); // unsafe chunking
    v = _mm_packus_epi16(v,v); // convert digits from UTF16-LE to ASCII
    v = _mm_subs_epu8(v, _mm_set1_epi8('0'));
    v = _mm_sll_epi64(v, shift_amount); // shift off non-digit trash

    // convert
    v = _mm_madd_epi16(_mm_maddubs_epi16(mul1, v), mul2);
    v = _mm_add_epi32(_mm_add_epi32(v,v), _mm_shuffle_epi32(v, 1));
    
    return (unsigned)_mm_cvtsi128_si32(v);
}

Regardless of how one shifts/aligns the string (see aepot's anwser), we want to stay away from pmulld. SSE basically has 16-bit integer multiplication and the 32-bit multiply has double the latency and uops. However, care must be taken around the sign-extension behavior of pmaddubsw and pmaddwd.

using scalar x64:

// untested && I don't know C#
public unsafe static uint ParseUint(string text)
{
  fixed (char* c = text)
  {
    var xmm = Sse2.LoadVector128((ushort*)c); // unsafe chunking
    var packed = Sse2.PackSignedSaturate(xmm,xmm); // convert digits from UTF16-LE to ASCII
    ulong val = Sse2.X64.ConvertToUInt64(packed); // extract to scalar

    val -= 0x3030303030303030; // subtract '0' from each digit
    val <<= ((8 - text.Length) * 8); // shift off non-digit trash

    // convert
    const ulong mask = 0x000000FF000000FF;
    const ulong mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32)
    const ulong mul2 = 0x0000271000000001; // 1 + (10000ULL << 32)
    val = (val * 10) + (val >> 8);
    val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;
    return (uint)val;
  }
}

What if we don't know the length of the number ahead of time?

// C pseudocode & assumes ascii text
uint64_t v, m, len;
v = unaligned_load_little_endian_u64(p);
m = v + 0x4646464646464646; // roll '9' to 0x7F
v -= 0x3030303030303030; // unpacked binary coded decimal
m = (m | v) & 0x8080808080808080; // detect first non-digit
m = _tzcnt_u64(m >> 7); // count run of digits
if (((uint8_t)v) > 9) return error_not_a_number;
v <<= 64 - m; // shift off any "trailing" chars that are not digits
p += m >> 3; // consume bytes
v = parse_8_chars(v);

Or if we have a list of strings to process:

// assumes ascii text
__m256i parse_uint_x4(void* base_addr, __m256i offsets_64x4)
{
    const __m256i x00 = _mm256_setzero_si256();
    const __m256i x0A = _mm256_set1_epi8(0x0A);
    const __m256i x30 = _mm256_set1_epi8(0x30);
    const __m256i x08 = _mm256_and_si256(_mm256_srli_epi32(x30, 1), x0A);
    const __m256i mul1 = _mm256_set1_epi64x(0x010A0A6414C814C8);
    const __m256i mul2 = _mm256_set1_epi64x(0x0001000A00FA61A8);
    __m256i v, m;

    // process 4 strings at once, up to 8 digits in each string...
    // (the 64-bit chunks could be manually loaded using 3 shuffles)
    v = _mm256_i64gather_epi64((long long*)base_addr, offsets_64x4, 1);

    // rebase digits from 0x30..0x39 to 0x00..0x09
    v = _mm256_xor_si256(v, x30);

    // range check
    // (unsigned gte compare)
    v = _mm256_min_epu8(v, x0A);
    m = _mm256_cmpeq_epi8(x0A, v);

    // mask of lowest non-digit and above
    m = _mm256_or_si256(m, _mm256_sub_epi64(x00, m));

    // align the end of the digit-string to the top of the u64 lane
    // (shift off masked bytes and insert leading zeros)
    m = _mm256_sad_epu8(_mm256_and_si256(m, x08), x00);
    v = _mm256_sllv_epi64(v, m);
    
    // convert to binary
    // (the `add(v,v)` allow us to keep `mul2` unsigned)
    v = _mm256_madd_epi16(_mm256_maddubs_epi16(mul1, v), mul2);
    v = _mm256_add_epi32(_mm256_shuffle_epi32(v, 0x31), _mm256_add_epi32(v,v)); 

    // zero the hi-dwords of each qword
    v = _mm256_blend_epi32(v, x00, 0xAA);

    return v;
}

Doing `_mm_subs_epu8` or `epu16` first, (or at least before shifting) would create more ILP / hide more latency of the shift-count calculation and of `movd` from int to XMM. Sometimes latency from shift-count to retval won't be significant, but sometimes it might be. And it won't *hurt* throughput. — Peter Cordes, Mar 07 '21 at 09:59
Also, `_mm_sll_epi64` by a vector is actually slower on Skylake (2 uops) than AVX2 `_mm_sllv_epi64` (1 uop). (Also allows different lengths if you packed 2 different strings into one 16-byte vector.) But the situation is reversed on Haswell: legacy shift-by-vector (count from low elem) is 1 uop, vs. 3 for variable. Zen is 1 uop for both, but sllv is higher lat, not fully pipelined(?). https://uops.info/table.html?search=psll%20q%20%22%2C%20xmm%2C%20xmm%22&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_CFL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_avx=on&cb_avx2=on — Peter Cordes, Mar 07 '21 at 10:10
Correction, on HSW, `_mm_sllv_epi64` is only 2 uops. Agner reports 3, but https://uops.info/table.html?search=sll%20q%20%22xmm%2C%20xmm)%22&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_HSW=on&cb_CFL=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_avx=on&cb_avx2=on measures 2, and I trust their automated testing more. (Link has the table filtered for these 2 insns, for HSW, CFL, ICL, Zen1, Zen2) — Peter Cordes, Mar 07 '21 at 10:17
IMO you didn't need to make this community wiki; it's hard enough to gain rep in low-traffic asm / SIMD tags that you deserve the credit for your part of writing and editing this answer. Too late to undo it now, but in future don't feel like you need to do that unless you want to disown the answer and not take responsibility for its future contents. — Peter Cordes, Mar 07 '21 at 16:21
Fun fact: `pmulld` used to be single-uop, on Intel before Haswell (e.g. Sandybridge). For AVX2, Intel changed to just sharing the mantissa multipliers of one (or in SKL both) FMA units (which are only 24 bits wide per 32-bit element), because having special 32-bit integer multiply hardware gets expensive for wider vectors. (16-bit integers are more common in DSP stuff, and cheaper.) — Peter Cordes, Mar 07 '21 at 16:25
according to Agner's tables `pmulld` vs `pmaddwd`: Wolfdale 4:1, Nehalem 2:1, Silvermount 7:1 — aqrit, Mar 07 '21 at 16:47
Really intesting ideas here. What if I want to check if a string contains 8 consecutive digits and then compute the result (or return false) ? — Carl Verret, Oct 05 '21 at 12:24
Checking for 8 consecutive digits using SSE2 is just a range check followed by a `pmovmskb` — aqrit, Oct 06 '21 at 20:27
@CarlVerret you've deleted my code out of your FastFloat library :-). https://github.com/fastfloat/fast_float/pull/28 & https://github.com/simdjson/simdjson/issues/1019 — aqrit, Oct 06 '21 at 20:56
@aqrit : Sorry, I don't recall doing so. My repo (csFastFloat) is a C# port of Lemire's fast_float. I also made public TestSIMD my test project / benchmark for SIMD optimisations. Given valuable information in this Q&A I been able to put some working code together that seem to be quick. — Carl Verret, Oct 07 '21 at 00:55

score 3 · Answer 3 · answered Feb 25 '21 at 17:09

First of all, 5x improvement is not “rather unimpressive”.

I would not do the last step with scalar code, here’s an alternative:

// _mm_shuffle_epi32( x, _MM_SHUFFLE( 3, 3, 2, 2 ) )
collapsed3 = Sse2.Shuffle( collapsed3, 0xFA );
// _mm_mul_epu32
var collapsed4 = Sse2.Multiply( collapsed3.As<int, uint>(), Vector128.Create( 10000u, 0, 1, 0 ) ).As<ulong, uint>();
// _mm_add_epi32( x, _mm_srli_si128( x, 8 ) )
collapsed4 = Sse2.Add( collapsed4, Sse2.ShiftRightLogical128BitLane( collapsed4, 8 ) );
return collapsed4.GetElement( 0 );

The C++ version gonna be way faster than what happens on my PC (.NET Core 3.1). The generated code is not good. They initialize constants like this:

00007FFAD10B11B6  xor         ecx,ecx  
00007FFAD10B11B8  mov         dword ptr [rsp+20h],ecx  
00007FFAD10B11BC  mov         dword ptr [rsp+28h],64h  
00007FFAD10B11C4  mov         dword ptr [rsp+30h],1  
00007FFAD10B11CC  mov         dword ptr [rsp+38h],64h  
00007FFAD10B11D4  mov         dword ptr [rsp+40h],1

They use stack memory instead of another vector register. It looks like JIT developers forgot there’re 16 vector registers there, the complete function only uses xmm0.

00007FFAD10B1230  vmovapd     xmmword ptr [rbp-0C0h],xmm0  
00007FFAD10B1238  vmovapd     xmm0,xmmword ptr [rbp-0C0h]  
00007FFAD10B1240  vpsrldq     xmm0,xmm0,8  
00007FFAD10B1245  vpaddd      xmm0,xmm0,xmmword ptr [rbp-0C0h]

This offers no improvements in performance. – Dmitri Nesteruk Feb 25 '21 at 17:15 — Dmitri Nesteruk, Feb 25 '21 at 17:15

SIMD string to unsigned int parsing in C# performance improvement

3 Answers3

Linked