3

Making my homework for implementing Conway's Game of Life using intrinsic functions found the working code, but cannot understand the main part of it.

This implementation first calculates amount of alive neighbors for each sell and store the result in an array counts, so the array of sells (world) is states. I cannot really get how the newstate is generated here. I understand how left shift works, how bitwise OR works, but I cannot understand why they're used like this, why shufmask is like this and how the shuffle works. Also cannot understand why _mm256_slli_epi16 used if the type of array elements is uint8_t. So my question is all about this string

__m256i newstate = _mm256_shuffle_epi8(shufmask, _mm256_or_si256(c, _mm256_slli_epi16(oldstate, 3)));

Could you please explain for me, dummy boy, if it's possible maximum detailed, how it works.

void gameoflife8vec(uint8_t *counts, uint8_t *states, size_t width, size_t height) {
assert(width % (sizeof(__m256i)) == 0);
size_t awidth = width + 2;
computecounts8vec(counts, states, width, height);
__m256i shufmask =
    _mm256_set_epi8(
        0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 1, 0, 0
    );
for (size_t i = 0; i < height; i++) {
    for (size_t j = 0; j < width; j += sizeof(__m256i)) {
        __m256i c = _mm256_lddqu_si256(
            (const __m256i *)(counts + (i + 1) * awidth + j + 1));
        c = _mm256_subs_epu8(
            c, _mm256_set1_epi8(
                1)); // max was 8 = 0b1000, make it 7, 1 becomes 0, 0 remains 0

        __m256i oldstate = _mm256_lddqu_si256(
            (const __m256i *)(states + (i + 1) * awidth + j + 1));
        __m256i newstate = _mm256_shuffle_epi8(
            shufmask, _mm256_or_si256(c, _mm256_slli_epi16(oldstate, 3)));
        _mm256_storeu_si256((__m256i *)(states + (i + 1) * awidth + (j + 1)),
            newstate);
    }
}
}

The memory for array is allocated in this way

uint8_t *states = (uint8_t *)malloc((N + 2) * (N + 2) * sizeof(uint8_t));
uint8_t *counts = (uint8_t *)malloc((N + 2) * (N + 2) * sizeof(uint8_t));

Also the source code can be found here https://github.com/lemire/SIMDgameoflife

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
lolmosk
  • 55
  • 7
  • See [here](https://stackoverflow.com/a/46602049) for a clear explanation of `_mm256_shuffle_epi8`. – wim Feb 06 '19 at 10:58
  • @wim the algorithm of shuffle is pretty much clear I meant that I cannot get how it conduct the new state of cell. So it should somehow check if how much alive neighbors near the cell and get the answer – lolmosk Feb 06 '19 at 11:01
  • I see. Apparently, I misunderstood the title of your question. Unfortunately I'm not familiar with Conway's Game. – wim Feb 06 '19 at 11:10
  • 1
    @wim edited it thank you – lolmosk Feb 06 '19 at 11:15
  • @wim: you've probably heard of https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life, but the OP forgot to capitalize "Life". If you haven't, check it out; cellular automata are nifty. It's even Turing-complete. – Peter Cordes Feb 06 '19 at 11:18
  • @PeterCordes: I've heard of it, but I've never thought about a computer implementation. Nice topic anyway. Cool application of `_mm256_shuffle_epi8` as a 16-entry LUT! – wim Feb 06 '19 at 12:24

1 Answers1

5

shuffle_epi8 is being used here as a parallel table-lookup, with a constant first operand and a variable 2nd operand.

Daniel's code does some calculations that produce a 4-bit integer for every byte in the vector, then uses _mm256_shuffle_epi8 to map those integers to 0 / 1 alive-or-dead new states.

Notice that the low and high lanes of shufmask are identical: it's the same lookup table for both lanes. (It's not a lane-crossing shuffle, it's 32 parallel lookups from 2x 16-byte tables, using the low 4 bits in each element. And the high bit to zero it out.) See the intrinsic and asm instruction documentation.


shufmask is a poor choice of variable name. It's not the shuffle-control vector. alivetable might be a better choice.


Using [v]pshufb to implement a 16-entry LUT is a (fairly) well-known technique. For example, it's one way to implement a popcnt for large arrays that's faster than scalar, splitting bytes into low/high nibble and looking up the 4-bit popcnt results. See Counting 1 bits (population count) on large data using AVX-512 or AVX-2, specifically https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • It's much much better in my mind now, but still cannot get thing about so named `shuffmask` or `alivetable`. Why is it exactly like this, how it map the 4-bit integers to 0 / 1. Sorry its too complicated for me. – lolmosk Feb 06 '19 at 12:07
  • Peter, please may be you could somehow illustrate 1 iteration of generating `newstate` or just explain it step by step, would be so nice from you – lolmosk Feb 06 '19 at 12:09
  • @lolmosk: [unexpected \_mm256\_shuffle\_epi with \_\_256i vectors](//stackoverflow.com/a/46602049) shows how `vpshufb` does 16 / 32 parallel select operations. The result is `result[i] = src1 [ src2[i] ]` (within each 16-byte lane). – Peter Cordes Feb 06 '19 at 12:32