3

I am representing a bitfield with __m128i and need a fast way to check whether or not a specific bit is set, and also a way to set a specific bit. Do I have to set up another __m128i as a mask and OR them, or is there an instruction I am missing that is faster? I am using the Intel compilers.

  • http://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit-in-c-c?rq=1 – Tim Feb 19 '14 at 21:44
  • @TimCastelijns `__m128` is not a standard C++ type. Anyway, is the solution you proposed too slow? – Bartek Banachewicz Feb 19 '14 at 21:45
  • @BartekBanachewicz it's not too slow, but for my own knowledge I would like to know if there is a faster method –  Feb 19 '14 at 21:46
  • The doc on __m128i states that you are *never* supposed to access this type directly. I didn't investigate further, but I assume this means that MS have provided suitable functions to test or access this data. Sorry, not much use I know... – Tim Bergel Feb 19 '14 at 21:50
  • @user1201584, if you have at least SSE4.1 you can test a bit with two intrinsics: `_mm_testz_si128` and `_mm_and_si128`. – Z boson Feb 20 '14 at 10:03
  • If the bit position you want to test is a compile-time constant, set with `_mm_or_si128`, and test with `_mm_testz_si_128`. Otherwise, you need to generate masks. You could maybe figure out which qword (or dword for 32bit) the bit you want is in, and extract that to a gp reg for use with BMI1/BMI2 instructions, or generate masks by using vector shifts that take the shift count in a register, instead of imm8. (The biggest bitwise shift available is 64bit, though, not 128.) If you're searching for the first set or unset bit, there are probably faster ways than just iterating over bits. – Peter Cordes Jun 19 '15 at 06:54

3 Answers3

4

You could try something like this. I don't believe there is a quicker way. You will likely want to pull some of the constant values and the table out of the performance crittle part of the code.

  __m128i v; // todo: set v to something here

  // to check
  int n; // todo: set n to the zero-indexed bit to check

  __m128i chkmask  = _mm_slli_epi16(_mm_set1_epi16(1), n & 0xF);
  int     movemask = (1 << (n >> 3));
  int     isSet  = (_mm_movemask_epi8(_mm_cmpeq_epi8(_mm_and_si128(chkmask, v), _mm_setzero_si128())) & movemask) ^ movemask;

  // to set
  int m; // todo: set m to the zero-indexed bit to set

  __m128i shuf    = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
          shuf    = _mm_add_epi8(shuf, _mm_set1_epi8(16 - (m >> 3)));
          shuf    = _mm_and_si128(shuf, _mm_set1_epi8(0x0F));
  __m128i setmask = _mm_shuffle_epi8(_mm_cvtsi32_si128(1 << (m & 0x7)), shuf);
  v = _mm_or_si128(v, setmask);

  // or to try the look-up table approach to check and set
  __declspec(align(16)) __m128i lut[] = {
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000001),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000002),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000004),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000008),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000010),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000020),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000040),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000080),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000100),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000200),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000400),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00000800),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00001000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00002000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00004000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00008000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00010000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00020000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00040000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00080000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00100000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00200000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00400000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x00800000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x01000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x02000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x04000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x08000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x10000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x20000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x40000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0x80000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000001, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000002, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000004, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000008, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000010, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000020, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000040, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000080, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000100, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000200, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000400, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00000800, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00001000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00002000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00004000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00008000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00010000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00020000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00040000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00080000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00100000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00200000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00400000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x00800000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x01000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x02000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x04000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x08000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x10000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x20000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x40000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000000, 0x80000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000001, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000002, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000004, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000008, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000010, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000020, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000040, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000080, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000100, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000200, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000400, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00000800, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00001000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00002000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00004000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00008000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00010000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00020000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00040000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00080000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00100000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00200000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00400000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x00800000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x01000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x02000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x04000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x08000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x10000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x20000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x40000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000000, 0x80000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000001, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000002, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000004, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000008, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000010, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000020, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000040, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000080, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000100, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000200, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000400, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00000800, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00001000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00002000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00004000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00008000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00010000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00020000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00040000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00080000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00100000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00200000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00400000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x00800000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x01000000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x02000000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x04000000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x08000000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x10000000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x20000000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x40000000, 0x00000000, 0x00000000, 0x00000000),
    _mm_set_epi32(0x80000000, 0x00000000, 0x00000000, 0x00000000)
  };

   // to check with look-up table
   movemask = (1 << (n >> 3));
   isSet    = (_mm_movemask_epi8(_mm_cmpeq_epi8(_mm_and_si128(v, _mm_load_si128(lut + m)), _mm_setzero_si128())) & movemask) ^ movemask;

   // to set with look-up table
   v = _mm_or_si128(v, _mm_load_si128(lut + m));
Apriori
  • 2,308
  • 15
  • 16
  • This is the best answer, since I can't find anything better either. This is too many instructions, so I'm going to look back at just using two 64-bit ints as the bit field –  Feb 20 '14 at 00:39
  • @user1201584 You might also try having a look-up table of 128 16-byte values. But of course you have to touch memory for that (in cache). It's hard to say how fast that would be without a test. If performance matters, you will probably want to to test different solutions. I've seen bloated SSE code like this outperform simpler 64-bit versions, and in cases you wouldn't think the added logical complexity was worth the extra register width. – Apriori Feb 20 '14 at 00:47
  • Was going to profile this, but _mm_slli_si128(a,b) --> b has to be an immediate value. Any work-around? –  Feb 20 '14 at 01:58
  • @user1201584 Ah, sorry forgot about that. This is what I get for not having VS on my internet machine. Are you wanting to stick to SSE2? Or is SSSE3 and SSE4 ok? – Apriori Feb 20 '14 at 02:12
  • Everything <=SSE4.0 is ok –  Feb 20 '14 at 02:17
  • @user1201584 I posted an update. It's pretty easy to do the check since you can easily do some extra logic once the movemask has been pulled out. I posted a couple versions to set the bit. The shuffle should do a circular shift, the instruction only looks at the low nibbles. I'm not sure if any of this will be more efficient. Sorry for not building the first solution before posting it, tbh I haven't been able to test the update either but I'm needing to step away from my machine for a bit. Hopefully I'll have time later. – Apriori Feb 20 '14 at 03:04
  • Also, I'm not sure that things like _mm_set_epi64 resolve to an instruction. But I'd need to look at the assembly to be sure. – Apriori Feb 20 '14 at 03:05
  • I appreciate your time, however none of these work (improper bits set, check doesn't appear to work right either). If you have time later I will check back on the question, otherwise don't worry about it :) Cheers –  Feb 20 '14 at 04:17
  • @user1201584 It was a slippery slope ever since I submitted that first answer without building. Sorry about that, I really do know better than to assume the code works the first time. VS on a machine with a connection to the internet is a wondrous thing. Anyway, I've fixed the issues and confirmed that it works by debugging though it so feel free to give it a try again. I included a lookup table solution, I think that might be your best bet for speed, at least if this code stays SSE. But it’s always hard to say without perf testing. – Apriori Feb 20 '14 at 08:17
  • @Apriori, there is a faster way to check at bit with SSE4.1 using `_mm_testz_si128`. See my answer. – Z boson Feb 20 '14 at 09:54
  • @Z boson, using _mm_testz_si128 is definitely an improvement over my answer, I didn't know that instruction/intrinsic was there; thank you. I think the check solutions are pretty closely matched, but adding the sub 1 really cleans up the scalar portion of the code; nicely done. However one thing your solution is missing is the ability to create the mask with which to check and set. Unless I'm mistaken I believe the asker wanted to be able check/set based on a bit index. Luckily it's easy to drop the lookup table above into your solution to accomplish this. – Apriori Feb 20 '14 at 16:09
  • @user1201584, I believe Z boson's check/set code is superior to what I posted, I won't be offended if you unaccept my answer and accept his. At this time it is however missing the ability to create the mask with which to check; so I think portions of both answers should be consolidated somewhere for the sake of a complete answer. That is unless someone comes up with a better way to create the 16-byte mask. – Apriori Feb 20 '14 at 16:12
  • Unfortunately I can't use SSE4.1, but Z boson's answer is good to know. Both of the examples you've created are faster than what I was doing. I don't have the proper framework to do a real benchmark, but I appreciate your time in fixing this. I'll keep it checked since it is what I am actually going to use in my project. Cheers! –  Feb 20 '14 at 16:16
  • @user1201584, No problem, glad to help. You can get an accurate idea of performance just looping over something a bunch of times and calling a low precision timer outside of the loop. I took the time to do something similar on a recent question here if you are looking for a starting point. http://stackoverflow.com/questions/21770799/convert-0x1234-to-0x11223344/21786118#21786118 – Apriori Feb 20 '14 at 16:28
1

For what it's worth here is a variation I came up with for testing a bit. If the mask and one resister can be precomputed then this only needs three intrinsic.

For setting single bits I don't think there is an efficient way. Here is a discussion on going from movemask back to an SSE register How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

#include <emmintrin.h>
#include <stdio.h>
int main() {
    __m128i x = _mm_setr_epi32(0,0,0,1);
    __m128i mask = _mm_setr_epi32(0,0,0,1);
    __m128i one = _mm_set1_epi8(1);
    int isSet = 0xffff != _mm_movemask_epi8(_mm_sub_epi8(_mm_and_si128(x,mask),one));
    printf("%X\n", isSet);  
}

Edit actually there is a faster way to check a bit with SSE4.1 using _mm_testz_si128.

#include <smmintrin.h>
#include <stdio.h>

int main() {
    __m128i x = _mm_setr_epi32(0,0,0,1);
    __m128i mask = _mm_setr_epi32(0,0,0,1);

    __m128i t = _mm_and_si128(x,mask);
    int isSet = !_mm_testz_si128(t,t);

    printf("%d\n", isSet);  
}
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
0

There are no instructions for setting individual bits in __m128i.

You can try using the general-purpose BTS instruction, but it will probably be slower than making a mask, because it can only write to memory (or to 32-bit registers, which doesn't help).

anatolyg
  • 26,506
  • 9
  • 60
  • 134