0

Problem

Does an instruction exist that gathers/extracts the first bit of an int[32] and stores it into a int?

  • I know about the intrinsic pext but that is not really what I want.

  • I do have a code for that but I thought maybe there a designated instruction.

  • The ints array is zero besides the first bit. Ergo, no masking needed.

Code

void ints2bits(int &bits, int *ints) {
   bits = (ints[0] << 0) + (ints[1] << 1) + ... + (ints[31] << 31);
}

UPDATE & FEEDBACK:

Just tested harold suggestions. It works very well and I can attain nice speed up.

Armen Avetisyan
  • 1,140
  • 10
  • 29
  • If you want to extract first bit of all 31 integers of array and a store the bits at appropriate place in an int, then you code is wrong. – sameerkn Aug 09 '16 at 05:48
  • It is not wrong. I literally use it. The question is only if there is a specific instruction for this operation. Masking of the array is not needed by the way, which is maybe the reason you claim its faultiness. – Armen Avetisyan Aug 09 '16 at 07:39
  • Your code assumes that all the other bits in the ints are zero, you should probably state that explicitly in your problem specification. – samgak Aug 10 '16 at 02:52
  • Why did you edit your question without accepting harold's answer? It's still the correct answer: vector left shift -> `movmskps` does exactly what you're asking for. – Peter Cordes Aug 11 '16 at 07:21

1 Answers1

2

There is no single instruction that can even read that much data, but groups of 4 (8 with AVX2) can be processed quickly with the use of _mm_movemask_ps. Ignore the fact that it claims to be a floating point instruction, it just gathers and appends the 4 top bits.

Of course moving the bottom bit into the top is easy with _mm_slli_epi32.

So putting it together (not tested)

int res = 0;
for (int i = 0; i < 32; i += 4) {
    __m128i x = _mm_load_si128((__m128i*)&ints[i]); // I assume it's aligned
    x = _mm_slli_epi32(x, 31);
    int bits = _mm_movemask_ps(_mm_castsi128_ps(x));
    res += bits << i;
}

The extension to AVX2 is pretty obvious.

An other possible approach is shifting each lane by a variable amount (pre-AVX2 this requires multiplication) and then summing, first vertically of course, saving the horizontal sum for last. This is likely slower, and certainly more awkward.

harold
  • 61,398
  • 6
  • 86
  • 164