SSE optimized emulation of 64-bit integers

Question

For a hobby project I'm working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast.

Currently, I'm doing this via MMX instructions, but that's really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior).

So I'm wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE.

The operations I need are the following (quite specific) ones:

uint64_t X, Y;

X = 0;
X = 1;
X << 1;
X != Y;
X + 1;
X & 0x1 // get lsb
X | 0x1 // set lsb
X > Y;

Specifically, I don't need general-purpose addition or shifting, for example, just add one and left-shift one. Really, just the exact operations shown here.

Except, of course, on x86, uint64_t is emulated by using two 32-bit scalars, which is slow (and, in my case, simply doesn't work, because I need loads/stores to be atomic, which they won't be when loading/storing two separate registers).

Hence, I need a SIMD solution. Some of these operations are trivial, supported by SSE2 already. Others (!= and <) require a bit more work.

Suggestions? SSE and SSE2 are fine. It'd take some persuasion to permit SSE3, and SSE4 is probably out of the question (A CPU which supports SSE4 is likely to run 64-bit anyway, and so I don't need these workarounds)

64-bit integer addition is directly supported in SSE2. I assume you also need 64-bit multiplies? 64 x 64 -> 64-bit (lower half), or do you need 64 x 64 -> 128-bit? — Mysticial, Apr 19 '12 at 09:21
Multiply isn't needed, Just the specific ops I showed above (so not even general addition, just increment by 1. And yeah, addition is provided by SSE2, but I figured I might as well just show all the operations I needed, for completeness. Just means some of them are easy :) — jalf, Apr 19 '12 at 09:24
What do you want the logical operators to output to. General register? or SSE? — Mysticial, Apr 19 '12 at 09:39
If you are using a CPU which isn't able to do 64 bit but has support for SSE2, this would be an Athlon XP, a Pentium III or an older Pentium IV. In the case of an Athlon XP I won't expect any performance gain at all, because it does split every SSE operation in two 64 bit operations, which are then executed separatly. For a Pentium III - well I don't know. For a Pentium IV you may be able to get some speed up - depends on how often transfers from and to general purpose regsiters come into play, because these are notoriously slow on this hardware. — Gunther Piez, Apr 19 '12 at 09:39
@drhirsch, people still use 32bit OS's though, so all that 64bit hardware is nice but you often can't use it. — harold, Apr 19 '12 at 09:42
@harold The OP writes that if he had a cpu which were able to run 64 bit he wouldn't need these workarounds, so I assume he doesn't have one. So he is stuck on ancient hardware. — Gunther Piez, Apr 19 '12 at 09:45
So the assumption you gave in your question is wrong: You can run a 32 bit OS on a modern CPU and have access to all SSE up to 4.2. — Gunther Piez, Apr 19 '12 at 09:48
@Harold Sure, SSE3, SSSE3, SSE4.1, SSE4.2 all run perfectly fine on a 32 bit OS, both Linux and Windows. — Gunther Piez, Apr 19 '12 at 09:50
As far as I can tell, SSE4 doesn't help much. If you're only doing scalar operations, then `_mm_testz_si128()` and `_mm_testnzc_si128` will save one instruction. — Mysticial, Apr 19 '12 at 09:55
@drhirsch I have no clue what point you're trying to make. Are you just nitpicking out of boredom? Yes, I know that the OS doesn't limit which SSE instruction sets are available. And my own machine is an i7 running on a 64-bit OS. But I'd like my code to work on other computers too, including ones that are, whether due to the OS or the CPU, limited to 32-bit code. And relying on, say, SSE4.2 would cut off most 32-bit computers. Relying on SSE2 would cover nearly them all. Now, do you have anything *relevant* to contribute? — jalf, Apr 19 '12 at 09:55
Why didn't you write that in this way your question? As it stands it sounds like you need 64 bit operations on a CPU which isn't _able_ to run in 64 bit mode - something old. — Gunther Piez, Apr 19 '12 at 10:04
@Mysticial: the output of the logical operators should ideally be placed in a general purpose register. — jalf, Apr 19 '12 at 10:38
*and because most MMX instructions deal with signed integers, and I need unsigned behavior* Huh? You might be talking about Intel's *C intrinsics* for MMX instructions that return signed `int` or `__int64` types. MMX doesn't have arithmetic right shift for 64-bit integers, but it *does* have [left/right logical shift](https://www.felixcloutier.com/x86/psrlw:psrld:psrlq). Only `paddsw` (16-bit add with signed saturation) and some pack shuffles are explicitly signed. `pcmpgtq` (64-bit compare signed greater-than) is new in SSE4.2, and not available on MMX registers. — Peter Cordes, Jun 06 '19 at 07:30

score 18 · Accepted Answer · edited Jun 06 '19 at 07:55

SSE2 has direct support for some 64-bit integer operations:

Set both elements to 0:

__m128i z = _mm_setzero_si128();

Set both elements to 1:

__m128i z = _mm_set1_epi64x(1);      // also works for variables.
__m128i z = _mm_set_epi64x(hi, lo);  // elements can be different

__m128i z = _mm_set_epi32(0,1,0,1);  // if any compilers refuse int64_t in 32-bit mode.  (None of the major ones do.)

Set/load the low 64 bits, zero-extending to __m128i

// supported even in 32-bit mode, and listed as an intrinsic for MOVQ
// so it should be atomic on aligned integers.
_mm_loadl_epi64((const __m128i*)p);     // movq or movsd 64-bit load

_mm_cvtsi64x_si128(a);      // only ICC, others refuse in 32-bit mode
_mm_loadl_epi64((const __m128i*)&a);  // portable for a value instead of pointer

Things based on _mm_set_epi32 can get compiled into a mess by some compilers, so _mm_loadl_epi64 appears to be the best bet across MSVC and ICC as well as gcc/clang, and should actually be safe for your requirement of atomic 64-bit loads in 32-bit mode. See it on the Godbolt compiler explorer

Vertically add/subtract each 64-bit integer:

__m128i z = _mm_add_epi64(x,y)
__m128i z = _mm_sub_epi64(x,y)

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_arithmetic.htm#intref_sse2_integer_arithmetic

Left Shift:

__m128i z = _mm_slli_epi64(x,i)   // i must be an immediate

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_shift.htm

Bitwise operators:

__m128i z = _mm_and_si128(x,y)
__m128i z = _mm_or_si128(x,y)

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_logical.htm

SSE doesn't have increments, so you'll have to use a constant with 1.

Comparisons are harder since there's no 64-bit support until SSE4.1 pcmpeqq and SSE4.2 pcmpgtq

Here's the one for equality:

__m128i t = _mm_cmpeq_epi32(a,b);
__m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));

This will set the each 64-bit element to 0xffffffffffff (aka -1) if they are equal. If you want it as a 0 or 1 in an int, you can pull it out using _mm_cvtsi32_si128() and add 1. (But sometimes you can do total -= cmp_result; instead of converting and adding.)

And Less-Than: (not fully tested)

a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a,b);
__m128i u = _mm_cmpgt_epi32(a,b);
__m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);

This will set the each 64-bit element to 0xffffffffffff if the corresponding element in a is less than b.

Here's are versions of "equals" and "less-than" that return a bool. They return the result of the comparison for the bottom 64-bit integer.

inline bool equals(__m128i a,__m128i b){
    __m128i t = _mm_cmpeq_epi32(a,b);
    __m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
    return _mm_cvtsi128_si32(z) & 1;
}
inline bool lessthan(__m128i a,__m128i b){
    a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a,b);
    __m128i u = _mm_cmpgt_epi32(a,b);
    __m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

I just updated the "less-than" code. Not 100% sure if it's correct though. — Mysticial, Apr 19 '12 at 10:11
I wrote a quick test of the "less than" case [here](http://ideone.com/U4LXd) for VS2010. Might need a bit of tweaking to run on other compilers. It fails for `0x00000000 < 0x00000001` (unless I made a mistake in the test) — jalf, Apr 19 '12 at 11:46
It works for me. Your link is showing that it isn't even compiling. Or am I missing something? — Mysticial, Apr 19 '12 at 15:36
@jalf, I found what was wrong with your test. You need `z.m128i_u32[0] == 0xffffffffull` instead of `z.m128i_u64[0] == 0xffffffffull`. (32-bit instead of 64-bit) But anyways, I've updated my answer with functions that return `bool`s. — Mysticial, Apr 19 '12 at 15:53
I just made another change to the less-than code. Apparently the comparison operators assume the input is signed. So I had to insert a couple `xor`s to invert the upper-most bits to fix this. At this point, it looks pretty bad, so it might be better to do it with `uint64_t` and see what the compiler generates. — Mysticial, Apr 19 '12 at 18:15
Hehe yeah, it's starting to look ugly (as expected). However, I already have these values in SSE registers (loads/stores have to be atomic, which means I can't just load a pair of 32-bit uints), and so switchin to `uint64_t` would involve additional conversions. Still, might be worth it. Thanks for trying though. I think this is about as long as my MMX attempt, so it does seem like a net improvement :) — jalf, Apr 19 '12 at 19:53
@Mysticial For 64bits signed less-than wouldn't be `_mm_shuffle_epi32(_mm_srai_epi32(_mm_sub_epi64(a, b), 31), _MM_SHUFFLE(3, 3, 1, 1));` easier and much more efficient? If `a-b` is negative (the sign bit is `1`), then `a` is less than `b`. — plasmacel, Dec 29 '16 at 09:43
@plasmacel Wouldn't that fail when `a-b` overflows? As a 32-bit example: `a=-2000000000`, `b=2000000000`, `a-b=+294967296`. That said, it's still a valid approach if you know it will never overflow. — Mysticial, Dec 29 '16 at 19:32
@Mysticial In that case just XOR `__m128i d = _mm_sub_epi64(a, b)` with an overflow detector `_mm_and_si128(_mm_xor_si128(a, b), _mm_xor_si128(d, a))`, then populate the signbit to create 0 or -1. — plasmacel, Dec 30 '16 at 01:16

SSE optimized emulation of 64-bit integers

1 Answers1

Linked