Creating bitmask in O(1) with SIMD operations

Question

I am new at c++ and working with simd.

What i am trying to do is creating a bitmask from input in constant time.

For example:

input 1,3,4 => output 13 (or 26 indexing scheme does not matter): 1101 (1st, 3rd and 4th bits are 1)
input 2,5,7 => output 82 : 1010010 (2nd, 5th and 7th bits are 1)

input type does not matter, it can be array for example.

I accomplished this with for loop, which is not wanted. Is there a function to create bitmask in a constant time?

Do you know in advance how many inputs there will be? What kind of simd library or intrinsics are you using? — Nate Eldredge, Dec 22 '21 at 08:09
Problem is creating mask for 64bit integer, input count is not fixed. I am using if this is what you have asked. I will use _pext(number,mask) after creating this mask @NateEldredge — Eldorado, Dec 22 '21 at 08:13
If the input count is not fixed, you'll at the very least have to iterate over the input, no? — You, Dec 22 '21 at 08:15
Thats what i think too, what if input is vector? does it need to be fixed size too then? @You — Eldorado, Dec 22 '21 at 08:32
CPU SIMD is based around short fixed-width vectors, 16, 32, or 64 bytes depending on the ISA and CPU capabilities. (A few CPUs support ISA extensions like ARM64 SVE that can let one binary take advantage of whatever vector width future CPUs might support, but AFAIK only ARM and RISC-V have that. There's no such x86-64 extension.) And within that, usually at most 64-bit elements, so you'd probably need some work to handle bit-positions larger than 63. — Peter Cordes, Dec 22 '21 at 09:39
And of course horizontal OR reductions after bucket-sort(?) into buckets for the right part of the vector. But that'll take O(log n), which is dominated by the O(n/8) = O(n) cost of reading n input elements. Knocking down the constant factor by 8 (in practice probably less) is nothing to sneeze at, but it's not a better big-O complexity *class*. — Peter Cordes, Dec 22 '21 at 09:44
Do you need "exact constant time" (for protection against timing attacks?) or just "as fast as possible"? If you never have more than 64 inputs, you are already in `O(64) = O(1)`. If you want to optimize your current implementation, you should better already start creating the mask at the point where your list is generated. — chtz, Dec 22 '21 at 10:59

Homer512 · Answer 1 · 2021-12-23T09:47:41.433

Constant time cannot work if you have a variable number inputs. You have to iterate over the values at least once, right?

In any case, you can use intrinsics to minimize the number of operations. You have not specified your target architecture or the integer size. So I assume AVX2 and 64 bit integers as output. Also, for convenience, I assume that the inputs are 64 bit.

If your inputs are smaller-sized integers than the output, you have to add some zero-extensions.

#include <immintrin.h>

#include <array>
#include <cstdint>
#include <cstdio>


std::uint64_t accum(const std::uint64_t* bitpositions, std::size_t n)
{
  // 2 x 64 bit integers set to 1
  const __m128i ones2 = _mm_set1_epi64(_m_from_int64(1));
  // 4 x 64 bit integers set to 1
  const __m256i ones4 = _mm256_broadcastsi128_si256(ones2);
  // 4 x 64 bit integers serving as partial bit masks
  __m256i accum4 = _mm256_setzero_si256();
  std::size_t i;
  for(i = 0; i + 4 <= n; i += 4) {
    // may be replaced with aligned load
    __m256i positions = _mm256_loadu_si256((const __m256i*)(bitpositions + i));
    // vectorized (1 << position) bit shift
    __m256i shifted = _mm256_sllv_epi64(ones4, positions);
    // add new bits to accumulator
    accum4 = _mm256_or_si256(accum4, shifted);
  }
  // reduce 4 to 2 64 bit integers
  __m128i accum2 = _mm256_castsi256_si128(accum4);
  __m128i high2 = _mm256_extracti128_si256(accum4, 1);
  if(i + 2 <= n) {
    // zero or one iteration with 2 64 bit integers
    __m128i positions = _mm_loadu_si128((const __m128i*)(bitpositions + i));
    __m128i shifted = _mm_sllv_epi64(ones2, positions);
    accum2 = _mm_or_si128(accum2, shifted);
    i += 2;
  }
  // high2 folded in with delay to account for extract latency
  accum2 = _mm_or_si128(accum2, high2);
  // reduce to 1 64 bit integer
  __m128i high1  = _mm_unpackhi_epi64(accum2, accum2);
  accum2 = _mm_or_si128(accum2, high1);
  std::uint64_t accum1 = static_cast<std::uint64_t>(_mm_cvtsi128_si64(accum2));
  if(i < n)
    accum1 |= 1 << bitpositions[i];
  return accum1;
}

EDIT

I have just seen that your example inputs use 1-based indexing. So bit 1 would be set for value 1 and input value 0 is probably undefined behavior. I suggest switching to zero-based indexing. But if you are stuck with that notation, just add a _mm256_sub_epi64(positions, ones4) or _mm_sub_epi64(positions, ones2) before the shift.

With smaller input size

And here is a version for byte-sized input integers.

std::uint64_t accum(const std::uint8_t* bitpositions, std::size_t n)
{
  const __m128i ones2 = _mm_set1_epi64(_m_from_int64(1));
  const __m256i ones4 = _mm256_broadcastsi128_si256(ones2);
  __m256i accum4 = _mm256_setzero_si256();
  std::size_t i;
  for(i = 0; i + 4 <= n; i += 4) {
    /*
     * As far as I can see, there is no point in loading a full 128 or 256 bit
     * vector. To zero-extend more values, we would need to use many shuffle
     * instructions and those have a lower throughput than repeated
     * 32 bit loads
     */
    __m128i positions = _mm_cvtsi32_si128(*(const int*)(bitpositions + i));
    __m256i extended = _mm256_cvtepu8_epi64(positions);
    __m256i shifted = _mm256_sllv_epi64(ones4, extended);
    accum4 = _mm256_or_si256(accum4, shifted);
  }
  __m128i accum2 = _mm256_castsi256_si128(accum4);
  __m128i high2 = _mm256_extracti128_si256(accum4, 1);
  accum2 = _mm_or_si128(accum2, high2);
  /*
   * Until AVX512, there is no single instruction to load 2 byte into a vector
   * register. So we don't bother. Instead, the scalar code below will run up
   * to 3 times
   */
  __m128i high1  = _mm_unpackhi_epi64(accum2, accum2);
  accum2 = _mm_or_si128(accum2, high1);
  std::uint64_t accum1 = static_cast<std::uint64_t>(_mm_cvtsi128_si64(accum2));
  /*
   * We use a separate accumulator to avoid the long dependency chain through
   * the reduction above
   */
  std::uint64_t tail = 0;
  /*
   * Compilers create a ton of code if we give them a simple loop because they
   * think they can vectorize. So we unroll the loop, even if it is stupid
   */
  if(i + 2 <= n) {
    tail = std::uint64_t(1) << bitpositions[i++];
    tail |= std::uint64_t(1) << bitpositions[i++];
  }
  if(i < n)
    tail |= std::uint64_t(1) << bitpositions[i];
  return accum1 | tail;
}

You could get more ILP in the cleanup by leaving `accum2 = _mm_or_si128(accum2, high2);` until after the `if()`, so ORing in the final 128-bit vector can run in parallel with `vextracti128`. Otherwise yeah, looks sensible, although int64 is an unfortunately huge type for storing a bit-index. Would make more sense to use `uint8_t` or `int`, even though that means a SIMD version needs a `vpmovzxbq` load, which is a pain to write in a strict-aliasing-safe way with intrinsics. — Peter Cordes, Dec 23 '21 at 05:58
@PeterCordes right. I extended the answer a bit. I feel like the reduction from 128 bit to 64 bit is suboptimal, too. Is _mm_extract_epi64 more appropriate than _mm_unpackhi_epi64? — Homer512, Dec 23 '21 at 09:51
`vpunpckhqdq` / op / `vmovq` is generally optimal for horizontal reductions with AVX, except in cases where you have one final operation left to do for the high half, which the low half doesn't need. (As in [this string->int](https://stackoverflow.com/questions/70420948/most-insanely-fastest-way-to-convert-9-char-digits-into-an-int-or-unsigned-int/70432059#comment124519909_70432059) with place-value multiply). `vpextrq` decodes as a shuffle + movq on most CPUs (except Alder Lake E-cores where it's 1 uop: https://uops.info/), so vmovq + vpextrq + OR is 4 total uops, and about the same latency. — Peter Cordes, Dec 23 '21 at 10:21

Creating bitmask in O(1) with SIMD operations

1 Answers1

EDIT

With smaller input size