3

I need to do some bit manipulations: get k bits from i-th in a 64-bit register, where 9 ≤ k ≤ 12. k may change depending on a value we've read, i is counted from MSB and increments by k after each read. (i and k are only needed to describe the task, neither of them has to be in the final solution)

The simplest solution I came up with looks like that in C:

uint32_t result = ((n << i) >> (64 - k));

I wonder if there is a better way (especially in terms of performance) to do the same thing. E.g. I found BEXTR assembly instruction that would do exactly what I need.

Would it work faster? Or maybe there is a better way to do all of that?

I'm not sure if benching it on my single processor would be sufficient, thus the question is more about how much the instruction is optimized in microcode

Upd. 1 Nate Eldredge gave a wonderful reference: uops.info

Upd. 2 Jester and Peter Cordes gave a nice idea of using shrx + bzhi

Yegorka
  • 31
  • 4
  • This sounds like something you can benchmark and measure against your other solutions. – paddy Nov 25 '21 at 00:09
  • Are `i` and `k` known in advance? (not necessarily constant, but known well ahead of `n`) A quirk of BEXTR is that it needs the start/length encoded in a slightly inconvenient format which would itself take a couple of instructions to prepare – harold Nov 25 '21 at 00:15
  • 4
    The canonical reference is https://uops.info/ – Nate Eldredge Nov 25 '21 at 00:25
  • 1
    Are you counting bits from the MSB? That happens to be inconvenient. – Jester Nov 25 '21 at 00:29
  • 3
    `BEXTR` looks like 1 uop on AMD CPUs, and 2 uops on Intel. So on AMD it's better than shift/mask, and on Intel they might be about the same. – Nate Eldredge Nov 25 '21 at 00:33
  • Thanks, @NateEldredge for the reference! I came looking for copper and found I gold – Yegorka Nov 25 '21 at 00:34
  • 2
    At least clang will recognize the pattern `(n >> shift) & mask` and use `bextr` if appropriate. Since you count from the MSB, this is going to get ugly. For most current cpus clang chooses `shrx` and `bzhi`. – Jester Nov 25 '21 at 00:49
  • Note that any reasonable compiler will convert your C code into a BEXTR if you enable optimization and compile for a target that supports it (which the default target might not) – Chris Dodd Nov 25 '21 at 01:27
  • Do you have proof of that? None of clang, gcc, icc seem to generate `bextr`. The most success I had was with clang and a different pattern (see above). – Jester Nov 25 '21 at 01:34
  • 1
    You might be better off with `shrx` / `bzhi` given a start and length for your bitfield, especially Intel CPUs. Or yeah, a pair of shifts is fine on Intel if your control integers are naturally in separate registers, as long as you can use BMI2 `shrx`. (Variable-count `shl reg, cl` is 3-uop on Intel because of conditionally updating FLAGS in case count = 0, x86 baggage.) *How* does your loop find out the next `i` value? – Peter Cordes Nov 25 '21 at 04:18
  • Your current question title (about performance of BEXTR) is not all that interesting, and easily answered by https://uops.info/ and/or https://agner.org/optimize/. But it seems what you really want to ask is about bitfield extraction with loop-invariant length but varying start. (Or varying end if your numbers are in terms of distance from MSB). Given the updates, you should probably edit the question (especially the title) to ask about that specific case. It's only a couple instructions so the details and surrounding code really do matter a lot. – Peter Cordes Nov 25 '21 at 08:06
  • 3
    BTW an alternative approach to a "decode loop" (which it sounds like you have) is to *move the whole window* after you extract the op k bits from it. That way you don't have to solve this more general problem of "extract k bits from position i", it's always the top k bits. – harold Nov 25 '21 at 14:01
  • Do you really have to do this on a register once or do you actually have a bunch of `uint64_t` in memory that you fetch into a register and process one-by-one? In the latter case, you could vectorize the whole thing to have even better performance. – Marco Bonelli Nov 25 '21 at 19:56
  • I just have binary data that I need to interpret as a contiguous bunch of 9 to 12 bit numbers. Bitness of the latter number may change depending on the value of the current one, so I can't vectorize my operations easily, but I'm thinking of a way to do it – Yegorka Nov 25 '21 at 20:15
  • 1
    Can you show this in the context of the loop it will be in? – Erik Eidt Nov 25 '21 at 21:11
  • 1
    Yeah probably hard to beat scalar if you have a list of bit-lengths to extract, even if you had AVX-512VBMI for [`vpmultishiftqb`](https://www.felixcloutier.com/x86/vpmultishiftqb) which is a parallel bitfield-extract that grabs 8x 8-bit chunks from arbitrary positions in the corresponding source qword. So it wouldn't work well for variable length. It's useful for stuff like int->hex, e.g. see the version using it in [How to convert a binary integer number to a hex string?](https://stackoverflow.com/q/53823756) (@MarcoBonelli) – Peter Cordes Nov 26 '21 at 02:33

0 Answers0