4

I got hold on an SUPER-FAST algorithm that generates an array of random bytes, uniformly. It's 6 times faster than c++ uniform distribution and mersenne-twister of std library.

The count of an array is divisible by 4, so it can be interpreted as array of integers. Casting each entry to an integer, produces values in the range [INT_MIN, INT_MAX]. But how can I transform these integer values to lie between my own [min, maximum]?

I want to avoid any if-else, to avoid branching.


Maybe I should apply some bitwise logic, to discard irrelevant bits in each number? (because all remaining, unmasked bits will be either 0 or 1 anyway). If I can extract the most significant bit in my maximum-value, I could mask any bits that are more significant than that one, in my integers.

For example, if I want my max to be 17, then it is 00010001 in binary form. Maybe my mask would then look as 00011111? I could then apply it to all numbers in my array.

But, this mask is wrong ...It actually allows values up to (1+2+4+8+16) :(

What can I do? Also, how to take care of the min?

Edit

I am generating millions of numbers every frame of my application, for neural networks. I managed to vectorize the code using AXV2 for float variants (using this post), but need to get integers working too.

Kari
  • 1,244
  • 1
  • 13
  • 27
  • What C++ version are you using? are the values guaranteed to be positive? – Tony Tannous Jan 04 '21 at 11:01
  • 1
    Since you are using C++, [why not using C++](https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution)? – user1810087 Jan 04 '21 at 11:02
  • 7
    If `min` and `max` are not powers of 2, I guess bitmask are useless. – Evg Jan 04 '21 at 11:03
  • 6
    Generating a random number in `[min, max]` is equivalent to generating a random number in `[0,max-min]` and then adding `min`. And this reduces the problem to generating a number in `[0,max]`. If `max` is of the form `2^n-1` then you can cut off superfluous bits. But if it is not then there is no algorithm that generates a random number in `[0,max]` from `01` uniform generator such that the result is uniform and it has stop property. And also forget about avoiding branching. – freakish Jan 04 '21 at 11:21
  • 1
    Typical solution is `rand() % max` which is not uniform. Another is `do { result = rand(); } while (result > max)` which is uniform but does not have stop property. Even though the expect iterations count is something around 6 if I remember correctly. – freakish Jan 04 '21 at 11:22
  • @freakish `%max+1` in this case as the range `MAX_INT]` is included. – Tony Tannous Jan 04 '21 at 11:28
  • @TonyTannous true, true. Too late to edit. :( – freakish Jan 04 '21 at 11:28
  • To maintain uniformity one can treat a binary random stream as an input to arithmetic decoder -- every value can have a unique probability and no bits of randomness are wasted. – Aki Suihkonen Jan 04 '21 at 11:37
  • There is a class `std::mt19937` and a corresponding class template `std::mersenne_twister_engine`. Learn how they are used with `std::uniform_int_distribution`, and then replace the engine with your own [uniform random bit generator](https://en.cppreference.com/w/cpp/named_req/UniformRandomBitGenerator). – Dialecticus Jan 04 '21 at 12:25
  • @TonyTannous C++17. The algorithm that I've linked in my question is 6 times faster than 'mersenne-twister' and distributions of std library. – Kari Jan 04 '21 at 12:49
  • @Yunnosch I am generating millions of numbers every frame of my application, for neural networks. I managed to vectorize the code using AXV2 for float variants, but need to get integers working too. – Kari Jan 04 '21 at 12:49
  • Not insisting on bitmasks really made things much clearer to me. Thanks for the edit. – Yunnosch Jan 04 '21 at 13:41
  • What's the range of generated values? What's the range for min and max? Are min and max constant for a long period of time? – Sopel Jan 04 '21 at 13:56
  • Also, what's the desired instruction set? – Sopel Jan 04 '21 at 14:01
  • semi-related: [What's the fastest way to generate a 1 GB text file containing random digits?](https://unix.stackexchange.com/a/324520) uses a multiplicative inverse to do the mod-10 part for 16-bit integers. (With AVX2) – Peter Cordes Jan 04 '21 at 14:47

3 Answers3

3

But how can I transform these integer values to lie between my own [min, maximum]?

Since the range may not be a power of two, bitmasking is out, but you found that out already.

Modulo is also out, it does not exist as a native operation in AVX2 (and even if it did, that wouldn't necessarily make it efficient).

There is an other option: multiply-high, using _mm256_mul_epu32 (unfortunately there is no "pure" multiply-high for 32bit numbers, like there is for 16bit numbers, so we're stuck with an operation that only does 50% useful work). The idea there is to take the input number x (full range) and the desired range r, then compute r * x / 2^32 where the division is implicit (implemented by taking the high half of the product).

x / 2^32 would have been a number in [0.0 .. 1.0) (excluding 1.0) if it was interpreted as a rational number, multiplying by r then stretches the range to be [0.0 .. r) (excluding r). That's not how it's calculated, but that's where the formula comes from.

Setting the minimum of the range is handled easily by adding min to the scaled result.

In code (slightly tested):

__m256i squish(__m256i x, int min, int max) {
    __m256i sizeOfRange = _mm256_set1_epi32((unsigned)max - min);
    __m256i scaled_even = _mm256_shuffle_epi32(_mm256_mul_epu32(x, sizeOfRange), 0xB1);
    __m256i scaled_odd = _mm256_mul_epu32(_mm256_shuffle_epi32(x, 0xB1), sizeOfRange);
    __m256i scaled = _mm256_blend_epi32(scaled_even, scaled_odd, 0xAA);
    return _mm256_add_epi32(scaled, _mm256_set1_epi32(min));
}

It's still an exclusive range, it cannot handle the full [INT_MIN .. INT_MAX] as output range. There is no way to even specify it, the most it can do is [INT_MIN .. INT_MAX) (or for example an equivalent range with zero offset: [0 .. -1) ).

It's also not really uniform, for the same reason that the simple modulo-based range reduction isn't really uniform, you just cannot fairly divide N marbles over K bins unless K happens to divide N evenly.

harold
  • 61,398
  • 6
  • 86
  • 164
  • Thank you! To emphasize this once again to future users, **this produces numbers in the range [min, max) max is not included**. And, this code also works with `unsigned int` by default :) – Kari Jan 05 '21 at 07:59
  • do we need the (unsigned) cast in the sizeofRange line? I removed it and it works without, including negative integers – Kari Jan 05 '21 at 08:00
  • 1
    @Kari it's not critical, but it should be free anyway. I just put it there to prevent any signed overflow shenanigans – harold Jan 05 '21 at 10:43
2

The core idea is to use modulo instead of bitwise masks, which are useless in non-power-of-2 case. No branching is also a bit weird requirement. What you want is "fast enough", not "no branching and bitwise masks".

So assume that we have a function

int rand();

that produces a random integer uniformly. If max is of the form 2^n-1 then the following

rand() % (max+1)

will uniformly produce a random integer in the range [0,max]. That's because the total number of integers is a power of 2.

Now if min and max are such that max-min is of the form 2^n-1 then the following

(rand() % (max-min+1)) + min

will uniformly produce a random integer in the range [min, max].

But what happens when max-min is not of the form 2^n-1? Then we are out of luck. The (rand() % (max-min+1)) + min method will still produce a random integer in [min, max] range but no longer uniformly. Why is that? Because when n is fixed and not a power of 2, then the total number of integers that give concrete r = x % n result varies depending on r.

However the method is not bad. The bigger max-min value is the closer it gets to uniform distribution and often it is good enough in practice. And it is very fast, no branching.

Another example is

upper = get_upper_power_of_2(max - min)

do
{
    tmp = rand() % upper;
} while (tmp > max - min);

result = tmp + min;

This method has nice property that it is uniform, but it has no stop property, i.e. theoretically this algorithm may never stop. It also has branching. But in practice it does stop very fast (with a huge probability) and so it is quite common algorithm. For example it is in the standard Java library.

Both methods of course have an issue when max-min overflows (i.e. when min is a big negative number) which can be fixed if we switch to unsigned integers and then back to integers.

As far as I know there is no algorithm that generates a random integer in [0, max] when max is not of the form 2^n-1 from 01 uniform generator such that the results are uniform and it has stop property. I think that no such algorithm can exist, but I failed to find the appropriate result in computer science.

freakish
  • 54,167
  • 9
  • 132
  • 169
  • So you don’t count the `do`/`while` loop as branching? – Caleb Jan 04 '21 at 13:20
  • @Caleb of course I do. Perhaps I should've said that explicitely. Fixed. – freakish Jan 04 '21 at 13:24
  • My bad — I conflated *And it is very fast, no branching* with the code that immediately follows, but you made it clear that that’s a different solution. Sorry! I think your edit is an improvement though. – Caleb Jan 04 '21 at 13:35
  • I made OP relax the bitmasks idea. I am convinced that your answer is discussing powers of two for other (and appropriate) reasons. Hence the question did not turn moving target on your answer. Otherwise please accept my apology. – Yunnosch Jan 04 '21 at 13:46
  • This is vectorizable. A sketch for AVX2 here: https://godbolt.org/z/hrjrda. For AVX-512 it's possible to use `_mm_ternarylogic_epi32` for faster blending. Some other things that may be worth a try would to be to: 1. unfold the loop for the first N iterations, 2. Instead of replacing only the values that are not in range where we waste rng cycles it might be better to generate 8 values and store `k<8` in contiguous fashion, I think this can be achieved with either masked sparse stores or shuffle+store. Approach 2. would also make it "branchless" – Sopel Jan 04 '21 at 14:33
2

If you have 2^N random bits in a value, you can put it into an integer range by doing:

r = ((value * (max-min)) >> N) + min

Effectively, you are treating your value as a fraction with the multiply. You are guaranteed to get values in `[min...max)'

This ends up being two vectorizable operations: mulhi, 'add'

r = _mm256_add_epi16(
      _mm256_mulhi_epi16(value, _mm256_set1_epi16(max-min)), 
    _mm256_set1_epi16(min));

Although if you want 32-bits, it looks like you'll need two mul_epi32 and a shuffle to get your result.

For 64 bit values, see: Getting the high part of 64 bit integer multiplication (although that doesn't do vectorized forms)

Halt State
  • 421
  • 2
  • 6