3

As part of a compression algorithm, I am looking for the optimal way to achieve the following:

I have a simple bitmap in a uint8_t. For example 01010011

What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint)

One way to achieve this is by shuffling a vector of 8 x maxint into a vector of zeros. But that first requires me to expand my uint8_t to the right shuffle bitmap.

I am wondering if there is a better way?

Z boson
  • 32,619
  • 11
  • 123
  • 226
Thomas Kejser
  • 1,264
  • 1
  • 10
  • 30
  • Can't think of a nice solution. You could make a table with all the precomputed _m256i, indexed by the uint8_t. Since the blend instructions want an immediate, you could have a table of blends. AVX512 will help with this I think. – Marc Glisse Feb 23 '15 at 22:17
  • 1
    Alternatively you might try broadcasting the byte into each lane, masking out the single significant bit in each one, and finally comparing to create the mask. – doynax Feb 23 '15 at 22:24
  • 1
    @MarcGlisse lol, we're all waiting for AVX512. This is quite literally 2 instructions. `kmov + vmovdqa32` – Mysticial Feb 23 '15 at 23:29
  • 1
    @doynax, yeah, that's essentially the solution I came up with. – Z boson Feb 24 '15 at 09:07
  • Continuing the puzzle, here is my related problem: http://stackoverflow.com/questions/28735461/shift-elements-to-the-left-of-a-simd-register-based-on-boolean-mask – Thomas Kejser Feb 26 '15 at 05:57

3 Answers3

4

I think I'd probably go for the "brute force and ignorance" approach initially, maybe something like this:

uint8_t u = 0x53; // 01010011

const union {
    uint32_t a[4];
    __m128i v;
} kLUT[16] = { { {  0,  0,  0,  0 } },
               { { -1,  0,  0,  0 } },
               { {  0, -1,  0,  0 } },
               { { -1, -1,  0,  0 } },
               { {  0,  0, -1,  0 } },
               { { -1,  0, -1,  0 } },
               { {  0, -1, -1,  0 } },
               { { -1, -1, -1,  0 } },
               { {  0,  0,  0, -1 } },
               { { -1,  0,  0, -1 } },
               { {  0, -1,  0, -1 } },
               { { -1, -1,  0, -1 } },
               { {  0,  0, -1, -1 } },
               { { -1,  0, -1, -1 } },
               { {  0, -1, -1, -1 } },
               { { -1, -1, -1, -1 } } };
__m256i v = _mm256_set_m128i(kLUT[u >> 4].v, kLUT[u & 15].v);

Using clang -O3 this compiles to:

movl    %ebx, %eax                ;; eax = ebx = u
andl    $15, %eax                 ;; get low offset = (u & 15) * 16
shlq    $4, %rax
leaq    _main.kLUT(%rip), %rcx    ;; rcx = kLUT
vmovaps (%rax,%rcx), %xmm0        ;; load low half of ymm0 from kLUT
andl    $240, %ebx                ;; get high offset = (u >> 4) * 16
vinsertf128 $1, (%rbx,%rcx), %ymm0, %ymm0
                                  ;; load high half of ymm0 from kLUT

FWIW I threw together a simple test harness for three implementations: (i) a simple scalar code reference implementation, (ii) the above code, (iii) an implementation based on @Zboson's answer, (iv) a slightly improved version of (iii) and (v) a further improvement on (iv) using a suggestion from @MarcGlisse. I got the following results with a 2.6GHz Haswell CPU (compiled with clang -O3):

scalar code:                                 7.55336 ns / vector
Paul R:                                      1.36016 ns / vector
Z boson:                                     1.24863 ns / vector
Z boson (improved):                          1.07590 ns / vector
Z boson (improved + @MarcGlisse suggestion): 1.08195 ns / vector

So @Zboson's solution(s) win, by around 10% - 20%, presumably because they need only 1 load, versus 2 for mine.

If we get any other implementations I'll add these to the test harness and update the results.


Slightly improved version of @Zboson's implementation:
__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
v = _mm256_xor_si256(v, mask);
return _mm256_cmpeq_epi32(v, _mm256_setzero_si256());


Further improved version of @Zboson's implementation incorporating suggestion from @MarcGlisse:
__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
return _mm256_cmpeq_epi32(v, mask);

(Note that mask needs to contain replicated 8 bit values in each 32 bit element, i.e. 0x01010101, 0x02020202, ..., 0x80808080)


Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Yes, you're probably right - the penalty for unaligned loads on Haswell/Broadwell is quite small, but it's still better to keep things aligned if possible. I just threw the above example together as a starting point rather than an actual solution, but I'll work on improving it. – Paul R Feb 23 '15 at 22:25
  • I think you're right about the -1 - I mentally translated `maxint` to `INT_MAX` but I see the OP also mentions SIMD booleans. I'll fix that. The loadu alignment penalty is 0 for aligned data as you say, and quite small for misaligned data, for some value of "quite small". – Paul R Feb 23 '15 at 22:35
  • 1
    I just checked and it does seem that to initialize my array, the best bet would be `const __m128i tab[]={_mm_set_epi32(0,0,0,0),...}` and hope that _mm_set_epi32 is evaluated at compile time to the array doesn't have to be dynamically initialized. So using a scalar array instead (what you are doing) makes sense. – Marc Glisse Feb 23 '15 at 22:38
  • Yes, I was working along similar lines just now and have updated the code in the answer - I've also tested it now. I'll take a look at the generated code next. – Paul R Feb 23 '15 at 22:42
  • Hmm, gcc refuses it :-( "initializer element is not constant" though g++ is ok with it. And it generates a horrible list of vmov* instead of a static array. I guess the original code was better :-( – Marc Glisse Feb 23 '15 at 22:44
  • Similar problem here with clang - I've now worked around the initialisation and alignment problem using a union (see above). – Paul R Feb 23 '15 at 22:57
  • C++11 wants -1U or UINT_MAX or something or it complains about narrowing. Looks good otherwise. – Marc Glisse Feb 23 '15 at 23:06
  • Thanks - I hadn't noticed the C++ tag initially so I coded this as C. It compiles without warnings with `clang -Wall` but may need further tweaks for C++ as you suggest. Also the ordering of the elements might be back-to-front, depending on how the mask is to be interpreted, but this too is easy to fix. – Paul R Feb 23 '15 at 23:15
  • Thanks for responding. The issue with this is that does a memory store/load. Even from L1 cache, this is already 4 cycles. – Thomas Kejser Feb 24 '15 at 10:35
  • @ThomasKejser: well it does 2 loads (no stores), and some of the load latency should get hidden by the adjacent ALU instructions (maybe all if you're lucky). I would suggest benchmarking this and any other solutions to see what works best for your particular use case (performance may well be data-dependent and access-pattern-dependent). – Paul R Feb 24 '15 at 10:41
  • 2
    Plus one for improving my answer, testing the performance, and showing me your clever solution using a LUT (even if you call it brute force it's still clever to me). – Z boson Feb 24 '15 at 13:14
  • @Zboson: heh - thanks - sometimes brute force is the way to go, or at least it can make a good foundation upon which you can build a better solution. – Paul R Feb 24 '15 at 13:31
  • 1
    yeah, I think your brute force technique can be useful in other cases. I updated my answer with your improvement (giving you credit of course). – Z boson Feb 24 '15 at 13:39
  • 1
    Even though it's not faster on my Haswell CPU here, I think the solution with MarcGlisse's further suggestion is probably the way to go - it uses fewer instructions and it may just be that my test harness is I/O bound, so it's potentially faster on other systems or in other contexts. – Paul R Feb 24 '15 at 13:46
  • Yeah, I agree, the solution by @MarcGlisse is probably the best one in general. – Z boson Feb 24 '15 at 15:19
  • Another possible optimisation is to hold the mask itself in a static const __m256i (or set it as const outside the test loop). Since this code will run in a very tight loop, that is how it will realistically do it anyway. – Thomas Kejser Feb 24 '15 at 16:45
  • @ThomasKejser: in my benchmark code the loading of the mask is outside the loop. I've tried doing this explicitly and also letting the compiler do the hoisting and I get the same result either way. – Paul R Feb 24 '15 at 17:25
2

Here is a solution (PaulR improved my solution, see the end of my answer or his answer) based on a variation of this question fastest-way-to-broadcast-32-bits-in-32-bytes.

__m256i t1 = _mm256_set1_epi8(x);
__m256i t2 = _mm256_and_si256(t1, mask);
__m256i t4 = _mm256_cmpeq_epi32(t2, _mm256_setzero_si256());
t4 = _mm256_xor_si256(t4, _mm256_set1_epi32(-1));

I don't have AVX2 hardware to test this on right now but here is a SSE2 version showing that it works which also shows how to define the mask.

#include <x86intrin.h>
#include <stdint.h>
#include <stdio.h>

int main(void) {
    char mask[32] = {
        0x01, 0x00, 0x00, 0x00,
        0x02, 0x00, 0x00, 0x00,
        0x04, 0x00, 0x00, 0x00,
        0x08, 0x00, 0x00, 0x00,
        0x10, 0x00, 0x00, 0x00,
        0x20, 0x00, 0x00, 0x00,
        0x40, 0x00, 0x00, 0x00,
        0x80, 0x00, 0x00, 0x00,
    };
    __m128i mask1 = _mm_loadu_si128((__m128i*)&mask[ 0]);
    __m128i mask2 = _mm_loadu_si128((__m128i*)&mask[16]);

    uint8_t x = 0x53; //0101 0011
    __m128i t1 = _mm_set1_epi8(x);
    __m128i t2 = _mm_and_si128(t1, mask1);
    __m128i t3 = _mm_and_si128(t1, mask2);
    __m128i t4 = _mm_cmpeq_epi32(t2,_mm_setzero_si128());
    __m128i t5 = _mm_cmpeq_epi32(t3,_mm_setzero_si128());
    t4 = _mm_xor_si128(t4, _mm_set1_epi32(-1));
    t5 = _mm_xor_si128(t5, _mm_set1_epi32(-1));

    int o1[4], o2[4];
    _mm_store_si128((__m128i*)o1, t4);
    _mm_store_si128((__m128i*)o2, t5);
    for(int i=0; i<4; i++) printf("%d \n", o1[i]);
    for(int i=0; i<4; i++) printf("%d \n", o2[i]);

}

Edit:

PaulR improved my solution

__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
v = _mm256_xor_si256(v, mask);
return _mm256_cmpeq_epi32(v, _mm256_setzero_si256());

with the mask defined as

int mask[8] = {
    0x01010101, 0x02020202, 0x04040404, 0x08080808,
    0x10101010, 0x20202020, 0x40404040, 0x80808080,
};

See his answer with performance testing for more details.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    Here is the result: `-1 -1 0 0 -1 0 -1 0`, if one uses unsigned integers and reverses it, I guess it's the expected output. – luk32 Feb 24 '15 at 11:33
  • @Zboson: I've put together a test harness with your code and mine now - see my edited answer for timing data (TL;DR: you win!). – Paul R Feb 24 '15 at 11:38
  • @luk32, the order is correct. If you print a __m256i register (0,-1,0,-1,0,0,-1,01) from least significant bit to most significant you get -1 -1 0 0 -1 0 -1 0. – Z boson Feb 24 '15 at 11:46
  • @PaulR, thanks for testing this! I assume you defined `__m256i mask` outside of the main loop? – Z boson Feb 24 '15 at 11:51
  • @PaulR, it might (though I doubt it) make a small difference to define `_mm256_set1_epi32(-1)` also outside of the loop. However, `_mm256_setzero_si256()` will make no difference because it's totally free. – Z boson Feb 24 '15 at 11:58
  • @Zboson: yes, I'll double-check the generated code but I'm pretty sure the constants are only loaded once. I've also improved on your implementation slightly (see latest edit to my answer). – Paul R Feb 24 '15 at 12:02
  • 1
    Couldn't you test `t1&mask==mask` instead of `t1&mask!=0` to save a xor? – Marc Glisse Feb 24 '15 at 12:21
  • @MarcGlisse: yes you can, but then you need to use `_mm256_set1_epi32` instead of `_mm256_set1_epi8` and oddly the generated broadcast code then seems to be slower (at least with clang), such that any benefit gets wiped out. – Paul R Feb 24 '15 at 12:24
  • 1
    @MarcGlisse: I just realised if you change the mask then you can still do this - oddly though the shorter version without the XOR is no faster. – Paul R Feb 24 '15 at 12:30
  • 1
    @MarcGlisse, good suggestion `t1&mask==mask`. I should have thought of that. – Z boson Feb 24 '15 at 13:44
  • Strictly speaking, shouldn't the mask in the improved solution be unsigned. Clang warns me of overflow if not. – Thomas Kejser Feb 24 '15 at 14:44
  • 1
    @ThomasKejser, yes, it appears that way. I get a warning with `g++` unless I used unsigned (I should use `-Wall` more often). – Z boson Feb 24 '15 at 14:48
  • Probably worth adding to the answer: Here is the mask for 64 bit vectors: (0x0101010101010101, 0x0202020202020202, 0x0404040404040404, 0x0808080808080808) – Thomas Kejser Feb 24 '15 at 15:50
1

Based on all the answers, I hacked up a solution using Agner Fog's excellent library (which handles both AVX2, AVX and SSE solutions with a common abstraction). Figured I would share it as an alternative answer:

// Used to generate 32 bit vector bitmasks from 8 bit ints
static const Vec8ui VecBitMask8(
      0x01010101
    , 0x02020202
    , 0x04040404
    , 0x08080808
    , 0x10101010
    , 0x20202020
    , 0x40404040
    , 0x80808080);

// As above, but for 64 bit vectors and 4 bit ints
static const Vec4uq VecBitMask4(
      0x0101010101010101
    , 0x0202020202020202
    , 0x0404040404040404
    , 0x0808080808080808);

template <typename V>
inline static Vec32c getBitmapMask();

template <> inline Vec32c getBitmapMask<Vec8ui>() {return VecBitMask8;};
template <> inline Vec32c getBitmapMask<Vec8i>() {return VecBitMask8;};
template <> inline Vec32c getBitmapMask<Vec4uq>() {return VecBitMask4;};
template <> inline Vec32c getBitmapMask<Vec4q>() {return VecBitMask4;};

// Returns a bool vector representing the bitmask passed.
template <typename V>
static inline V getBitmap(const uint8_t bitMask) {
    Vec32c mask = getBitmapMask<V>();
    Vec32c v1(bitMask);
    v1 = v1 & mask;
    return ((V)v1 == (V)mask);
}
Thomas Kejser
  • 1,264
  • 1
  • 10
  • 30
  • Cool - I tried to incorporate this into the test harness but it throws a lot of compile errors with clang++ - do I need to do anything other than `#include ` to make this work ? – Paul R Feb 24 '15 at 22:21
  • vectorclass.h should do it. You need to compile with C++11 though. – Thomas Kejser Feb 24 '15 at 22:23
  • Hmm - still getting a lot of errors even with `-std=c++11` - the first one is: `vectorf128.h:215:22: error: ambiguous conversion for functional-style cast from 'const Vec4fb' to 'Vec4ib'` - I'll try a different compiler when I get a chance (probably tomorrow). – Paul R Feb 24 '15 at 22:27
  • I'm a big fan of Agner's VCL. Your code is not optimal for AVX512 though. The VCL does not have `Vec64c` however. I assume this is because AVX512 only supports 32-bit and 64-bit integers. But in your case you only need to broadcast bytes. After that you act on 32-bit integers. You should be able to adjust your code so that it works for AVX512 as well. – Z boson Feb 25 '15 at 11:52
  • @Zboson: Correct, I am currently only compilingfor 256bit registers. I will eventually need to adjust for 512. Also, I may need a 16 bit int variant at some point.... There are a lot of things that would be great to add to Agner Fog's library. I hope to contribute a few things once I have my code running. – Thomas Kejser Feb 25 '15 at 14:28
  • @PaulR: Could be that you are compiling for a different SIMD instruction set. Might want to try `-mavx`. However, if it does not compile on a lower set, either my code is buggy or we found an issue with Agner's library. – Thomas Kejser Feb 25 '15 at 14:32
  • I'm compiling with `-mavx2` and I also noticed that my copy of vectorclass was not current so I moved form 1.14 to 1.16, but I still have a few errors remaining. When I get some more time I'll try and get this resolved as I've never played with this library before so it might be instructive to get to grips with it. – Paul R Feb 25 '15 at 14:40
  • @PaulR, try some hello world examples in the intro to the manual. The manual also suggest compile options but I only saw a special one (`-fabi-version=0`) for GCC (but on page 90 it also says this for Clang). It also lists some compile error examples. – Z boson Feb 25 '15 at 14:53
  • Thanks - it shouldn't take too long to get it working, but I'm a bit overloaded at work today - I'll come back to this when things quieten down again. – Paul R Feb 25 '15 at 15:24
  • Based the comment by @Mysticial there is probably a different and more efficient way to do this with AVX512 anyway. – Z boson Feb 27 '15 at 07:34