Questions tagged [bmi]

Bit Manipulation Instructions Sets (BMI sets) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD. The purpose of these instruction sets is to improve the speed of bit manipulation. (Courtesy of Wikipedia.)

Bit Manipulation Instructions Sets (BMI sets) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD. The purpose of these instruction sets is to improve the speed of bit manipulation. All the instructions in these sets are non-SIMD and operate only on general-purpose registers.

See: Wikipedia entry for BMI.

34 questions
15
votes
2 answers

Standard C++11 code equivalent to the PEXT Haswell instruction (and likely to be optimized by compiler)

The Haswell architectures comes up with several new instructions. One of them is PEXT (parallel bits extract) whose functionality is explained by this image (source here): It takes a value r2 and a mask r3 and puts the extracted bits of r2 into…
Vincent
  • 57,703
  • 61
  • 205
  • 388
13
votes
2 answers

Confusion about bsr and lzcnt

I'm a bit confused about both instructions. First let's discard the special case when the scanned value is 0 and the undefined/bsr or bitsize/lzcnt result - this difference is clear and not part of my question. Let's take the binary value 0001 1111…
hopperpl
  • 298
  • 3
  • 8
10
votes
0 answers

How to efficiently find the n-th set bit?

For code related to this question, I need to compute the following as fast as possible: Given a 32 bit integer i, compute the position of the n-th least significant set bit. Both n and the result should be 0-indexed. For example, given the number…
fuz
  • 88,405
  • 25
  • 200
  • 352
9
votes
2 answers

Portable efficient alternative to PDEP without using BMI2?

The documentation for the parallel deposit instruction (PDEP) in Intel's Bit Manipulation Instruction Set 2 (BMI2) describes the following serial implementation for the instruction (C-like pseudocode): U64 _pdep_u64(U64 val, U64 mask) { U64 res =…
markt1964
  • 2,638
  • 2
  • 22
  • 54
8
votes
2 answers

How to use x86intrin.h

In one of my applications, I need to efficiently de-interleave bits in a long stream of data. Ideally, I would like to use the BMI2 pext_u32() and/or pext_u64() x86_64 intrinsic instructions when available. I scoured the internet for doc on…
Michael Back
  • 1,821
  • 1
  • 16
  • 17
6
votes
2 answers

BMI for generating masks with AVX512

I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations. Here is the code I am…
Z boson
  • 32,619
  • 11
  • 123
  • 226
5
votes
1 answer

Select spans of set bits in a bitmask that overlap with a 1-bit in a selector bitmap

Given: A bitmask a (say, std::uint64_t), which contains at least one set (1) bit. A selector bitmask b which is a subset of a (i.e. a & b == b), and has at least one bit set. I want to select spans of contiguous 1-bits in a which overlap with a…
Tomilov Anatoliy
  • 15,657
  • 10
  • 64
  • 169
5
votes
2 answers

Compiler macro to detect BMI2 instruction set

I was searching on the web to find a proper solution, without much success. So I hope one of you know something about it: Is there any way to detect the "Intel Bit Manipulation Instruction Sets 2" (BMI2) compile time? I want to make some conditional…
plasmacel
  • 8,183
  • 7
  • 53
  • 101
4
votes
0 answers

How to get Rust compiler to emit BZHI instruction without resorting to platform-specific code?

The Rust compiler & LLVM are sometimes so smart. I used x = x & (x - 1) to clear the lowest significant set bit. It recognized this expression and translated it to the blsr intrinsic and gave me a big speedup. And without me having to use any…
user1002430
4
votes
2 answers

Multishift operation

How to implement without loop an operation on bitmasks, which for two bitmasks a and b of width n gives bitmask c of width 2 * n with following properties: i-th bit in c is set only if there is j-th bit in a and k-th bit in b and j + k == i C++…
Tomilov Anatoliy
  • 15,657
  • 10
  • 64
  • 169
3
votes
0 answers

Fast SIMD bit deposition in a regular pattern

Edit: The last method described below is by far the fastest, giving a 5+ times speedup on smaller datasets; the problem was the benchmark itself, which caused page faults and had too much data to fit in cache. My problem is the following: I have a…
Ovinus Real
  • 528
  • 3
  • 10
3
votes
0 answers

How to extract bitfield of loop-invariant size with varying start?

I need to do some bit manipulations: get k bits from i-th in a 64-bit register, where 9 ≤ k ≤ 12. k may change depending on a value we've read, i is counted from MSB and increments by k after each read. (i and k are only needed to describe the task,…
Yegorka
  • 31
  • 4
3
votes
3 answers

What are assembly instructions like PEXT actually used for?

I watched a youtube video on the Top 10 Craziest Assembly Language Instructions and some of these instructions have no obvious application to me. What's the point of something like PEXT, which takes only the bits from the second argument which match…
wesmlr
  • 53
  • 8
3
votes
1 answer

Does AVX support imply BMI1 support?

I have some code that relies on AVX. In the same code base I also use TZCNT. The latter is part of BMI1. I know I can test for this instruction using CPUID, but I'm lazy so I did not actually implement that. To test for support I simply perform an…
Johan
  • 74,508
  • 24
  • 191
  • 319
3
votes
1 answer

intrinsic for the mulx instruction

The mulx instruction was introduced with the BMI2 instruction set starting with the Haswell processor. According to Intel's documentation there should be an intrinsic for mulx unsigned __int64 umul128(unsigned __int64 a, unsigned __int64 b, unsigned…
Z boson
  • 32,619
  • 11
  • 123
  • 226
1
2 3