1

I need to port a Xorshift algorithm from scalar to vector code (SSE/SIMD version built with -march=nocona). I'm using the uint32_t version of the algorithm (taken directly from wiki):

#include <stdint.h>

struct xorshift32_state {
  uint32_t a;
};

/* The state word must be initialized to non-zero */
uint32_t xorshift32(struct xorshift32_state *state)
{
    /* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */
    uint32_t x = state->a;
    x ^= x << 13;
    x ^= x >> 17;
    x ^= x << 5;
    return state->a = x;
}

The main problems are:

  1. it uses uint32, so (by standard) it wrap around automatically
  2. due to SSE3 "limits", I would to stay with m128i (which is signed, and give to me all the operations I need, I believe)
  3. signed overflow is undefined behaviour in c++ standard

How would you manage this porting using SIMD? Dealing with epu32 and subtract half the uint32 max (and than add)?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
markzzz
  • 47,390
  • 120
  • 299
  • 507
  • 3
    The C++ Standard has nothing to say about SIMD. So, "signed overflow is UB" is of no concern as long as the oveflow happens during a SIMD operation. – j6t Apr 23 '21 at 07:21
  • 1
    *m128i (which is signed* - no it isn't, it's just a bag of bytes. It's up to you to use arithmetic vs. logical right shifts, whichever is more useful for your algorithm. `_mm_srli_epi64` vs. `_mm_srai_epi32` – Peter Cordes Apr 23 '21 at 07:57
  • @PeterCordes what do you mean with "logical right"? at some point, a uint32_t should be stored into a int32_t. which means that some truncation will occurs, no? – markzzz Apr 23 '21 at 08:56
  • 1
    *at some point, a uint32_t should be stored into a int32_t* - no. `__m128i` is just a collection of 16 bytes, 128 bits, in a defined order. `_mm_srli_epi32` mutates it in one specific way (like `>>` on `uint32_t`), `_mm_srai_epi32` mutates it in a different way (like `>>` on `int32_t`), while `_mm_add_epi32` mutates it in yet another way (like `+` on `uint32_t`, binary addition with well-defined wrap-around). The only way to access the contents is to extract or store them into a C object of some type. – Peter Cordes Apr 23 '21 at 09:04
  • 1
    Don't let the `epi` signed naming fool you; Intel picks `epi` instead of `epu` for cases where signed and unsigned do the exact same binary operation, and bizarrely even for logical right shift (shifting in zeros instead of copies of the sign bit, like `uint32_t` `>>` - [The difference between logical shift right, arithmetic shift right, and rotate right](https://stackoverflow.com/q/44694957)) – Peter Cordes Apr 23 '21 at 09:07
  • 1
    Also [What are bitwise shift (bit-shift) operators and how do they work?](https://stackoverflow.com/q/141525) – Peter Cordes Apr 23 '21 at 09:13
  • @PeterCordes I think I start to see. Is it correct to state that m128i work more similar as a 4 x uint32 instead of 4 x int32 (since all bits represent the number, and wrap around automatically)? such as `11111111 11111111 11111111 11111111 + 00000000 00000000 00000000 00000001` = `std::numeric_limits::max() + 1` without any further operation? – markzzz Apr 23 '21 at 12:24
  • Yes, *if* you use `_mm_add_epi32`. If you use `_mm_add_epi64` then it acts like 2x uint64_t. Also note that 2's complement wrapping addition and unsigned binary addition are the same operation (which SSE2 adds implement). Signed wraparound is from `0x7f` to `0x80` (for `_mm_add_epi8`). Or if you use `_mm_adds_epi8` (saturating addition), `0x7f + 1 = 0x7f` (https://www.felixcloutier.com/x86/paddsb:paddsw), vs. with unsigned-saturating addition (https://www.felixcloutier.com/x86/paddusb:paddusw), `0xff + 1 = 0xff` instead of wrapping to zero. (Useful for audio or video.) – Peter Cordes Apr 23 '21 at 12:36
  • @PeterCordes: ok, so let say i did the operation above (shift and xor), keeping the value in `m128i` (which works out of the box). at some time, I need to normalize the result, scaling the range in (for example) [`0.0, 1.0`]. once I do it (i.e. `x / std::numeric_limits::max()`), which intrinsic should i use such as converting `m128i` to `m128` and considering x as `uint` instead of `int`? it seems that `_mm_cvtepi32_ps` convert `packed signed 32-bit integers`, but I wont: it must consider it as `unsigned`. (feel free to ask me to open a new question, if that's the right place).thanks – markzzz Apr 23 '21 at 12:50
  • 1
    Right, x86 doesn't have unsigned<->float until AVX-512. For scalar uint32_t you can use scalar int64_t, but that's not an option for packed. [Related](https://stackoverflow.com/q/64592329). But IIRC, there are other ways to use random bits to generate random floats in the [0, 1.0] range involving integer stuff to create FP bit patterns, so google for that. Note that `float` can't exactly represent every `uint32_t` anyway; above 2^24 it rounds to a multiple of a power of 2, so dividing by 0xFFFFFFFF isn't possible. (That divisor itself would round up to 2^32 as a float) – Peter Cordes Apr 23 '21 at 12:59
  • Related: https://lemire.me/blog/2017/02/28/how-many-floating-point-numbers-are-in-the-interval-01/ and this http://mumble.net/~campbell/2014/04/28/uniform-random-float (haven't fully read the latter yet to see about implementing with SIMD, or what exactly it's saying about your simple division method.) – Peter Cordes Apr 23 '21 at 13:05
  • Ah right, found [How to perform uint32/float conversion with SSE?](https://stackoverflow.com/q/34066228) - it takes some work, usually including 2 conversions and an add. Or see [Strange uint32\_t to float array conversion](https://stackoverflow.com/a/40201556) for how compilers auto-vectorize clang's unsigned->float just stuffs bits into float bit-patterns and adds parts. – Peter Cordes Apr 23 '21 at 13:07
  • @PeterCordes i could move the range from 0,1 to -1,1, doing `- 2147483648` on the m128i `x` counter, than div for the value `0x7fffffff` (which can be expressed correctly in float). it will lost some precision on the division of course, but its ok for the audio processing task im doing). what do you think? – markzzz Apr 24 '21 at 08:38
  • Note that the `0, 1` range of floats only takes half the bit-patterns, and if you want a *uniform* distribution over that range, you also don't want a lot of very tiny floats spaced more closely than 1 epsilon. I think converting 0..2^24 (simply as int32_t) and multiplying by float `2^-24` would give you what you want, using all the mantissa bits. 0..2^31 might be worse, I haven't fully thought it through. Probably best to ask a separate question about SIMD float RNG if there isn't one already. There aren't enough bits in a float to be worth the extra trouble of converting 0..2^32-1 – Peter Cordes Apr 24 '21 at 08:52
  • @PeterCordes are you suggesting doing "& 0xFFFFFF" for each operation? keeping the range in 24 bit? For what is worth the signal i want just need 16 bit of precision (i.e. White noise) – markzzz Apr 24 '21 at 09:46
  • 2
    Yes, or logical right-shift by 8, to zero the high 8 bits. Or if you need to be able to get 1.0, (and want to avoid bias away from full-range), perhaps keep more than 24 bits so some values get rounded up on conversion, instead of every value being exactly representable. Or maybe adding 1 to the integer value before conversion? You can still get 0 when eventually converting to 16-bit for small numbers. You should really ask a new question about generating uniform random floats from 0..1.0 with SIMD, instead of just getting ideas off the top of my head that I haven't considered carefully. – Peter Cordes Apr 24 '21 at 09:53
  • 1
    Rounding for large integers would map e.g. 32 different integers to the same multiple of 32. I guess that just means they get generated more often, while very tiny numbers get generated less often, so maybe you would still be getting uniformity over the 0..1 range. (Which you wouldn't get by using the random integers as FP bit-patterns; that would generate an equal amount in every exponent range, thus exponentially more closer to 0.0). So yeah it's probably best to use at least 31 bits to have a chance to generate every positive finite float <= 1.0. – Peter Cordes Apr 24 '21 at 10:18
  • 1
    But seriously, just **ask a new question**. Hardly anybody's going to find these comments when looking for how to generate random floats with SIMD, so I'm not going to spend any more effort on it here. I'm not going to reply to any further comments about FP RNG in this comment thread. The question is interesting, and this is the wrong place to answer it. – Peter Cordes Apr 24 '21 at 10:19
  • @PeterCordes opened this https://stackoverflow.com/questions/67246102/seeded-random-uniform-float-generator-using-simd . Is question written well? – markzzz Apr 24 '21 at 18:24

0 Answers0