0

I have an input uint64_t X and number of its N least significant bits that I want to write into the target Y, Z uint64_t values starting from bit index M in the Z. Unaffected parts of Y and Z should not be changed. How I can implement it efficiently in C++ for the latest intel CPUs?

It should be efficient for execution in loops. I guess that it requires to have no branching: the number of used instructions is expected to be constant and as small as possible.

M and N are not fixed at compile time. M can take any value from 0 to 63 (target offset in Z), N is in the range from 0 to 64 (number of bits to copy).

illustration:

illustration

Curious
  • 507
  • 3
  • 16
  • Bit shift and bit masks should do the job. – Evg May 18 '21 at 09:35
  • 1
    Have you tried anything yet? Are N and M fixed or dynamic? – rustyx May 18 '21 at 09:41
  • @rustyx They are dynamic. I tried to use bit shifts and masks as @Evg suggested but it seems not very efficient as sometimes the write into Y is not required and I have to put `if` statement that harms performance. – Curious May 18 '21 at 09:46
  • 3
    it is impossible to propose something "more efficient" unless you show the "not so efficient" version. As long as there is nothing, anything is more efficient ;) – 463035818_is_not_an_ai May 18 '21 at 09:48
  • @largest_prime_is_463035818 I would expect that "efficient" implementation contains a small constant number of bit manipulation instructions without branching. – Curious May 18 '21 at 09:50
  • 1
    its the crux with questions asking for "efficient" without explaining what they consider as "efficient". An efficient implementation can be one that is easy to write and read. Please clarify the question. – 463035818_is_not_an_ai May 18 '21 at 09:53
  • @largest_prime_is_463035818 I added clarification in the question. – Curious May 18 '21 at 09:57
  • 2
    actually your edit does not change how I see the issue: Without having a correct working solution it is futile to look for "the most efficient" or to guess what could cause a slow-down. Imho you should first profile your existing solution to see if and why it is slow. And perhaps include that implementation in the question. Though, maybe I am just being too pedantic, colloquially we actually do know what you mean with "efficient" – 463035818_is_not_an_ai May 18 '21 at 10:00
  • I've spent some time thinking about this, and I'm not at all optimistic that there is a nice solution. There are solutions of course, they're just not nice, at least not anything I could think of while thinking about both old instructions (`shld` and `shrd` in particular) and new ones (`bzhi`, `bextr`, what use is the rest?) – harold May 18 '21 at 10:00
  • You accepted Aki's answer. So the high bits of Y and Z should be `0`, not merging with some existing values? (Or the *low* bits of Z should be `0`? Where is the bottom of Z in the diagram?) If so, it's just a bitfield-extract of two variable-width fields. Please clarify in the question (@harold (aka nicomp for the past week or so) commented about this under the answer). If Aki's answer doesn't do what you're asking for, you should also un-accept it. – Peter Cordes May 18 '21 at 17:34
  • @PeterCordes I updated the question. I am not sure if I should un-accept the answer as it provides all the required building blocks (that are enough for me), but doesn't have copy-paste ready fragment of code. – Curious May 18 '21 at 17:56
  • The answer was updated to include masking ideas after I commented. Now it does cover the full question of merging into the existing Y and Z. – Peter Cordes May 18 '21 at 17:57

1 Answers1

2

There's at least a four instruction sequence available on reasonable modern IA processors.

X &= (1 << (N+1)) - 1;     // mask off the upper bits
//  bzhi    rax, rdi, rdx

Z = X << M;                
//  shlx    rax, rax, rsi

Y = X >> (64 - M);         
//  neg     sil
//  shrx    rax, rax, rsi

The value M=0 causes a bit of pain, as Y would need to be zero in that case and also the expression N >> (64-M) would need sanitation.

One possibility to overcome this is

x = bzhi(x, n);
y = rol(x,m);
y = bzhi(y, m);   // y &= ~(~0ull << m);
z = shlx(x, m);   // z = x << m;

As OP actually wants to update the bits, one obvious solution would be to replicate the logic for masks:

xm = bzhi(~0ull, n);
ym = rol(xm, m);
ym = bzhi(ym, m);
zm = shlx(xm, m);

However, clang seems to produce something like 24 instructions total with the masks applied:

Y = (Y & ~xm) | y; // |,+,^  all possible
Z = (Z & ~zm) | z;

It is likely then better to change the approach:

x2 = x << (64-N);  // align xm to left
y2 = y >> y_shift; // align y to right
y = shld(y2,x2, y_shift); // y fixed

Here y_shift = max(0, M+N-64)

Fixing Z is slightly more involved, as Z can be combined of three parts:

zzzzz.....zzzzXXXXXXXzzzzzz, where m=6, n=7

That should be doable with two double shifts as above.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • Isn't it UB to shift 64-bit value 64 times when M = 0? – Curious May 18 '21 at 10:46
  • Sure it is. With compile time know constant inputs there has to be some sanitation. – Aki Suihkonen May 18 '21 at 10:50
  • I added clarification in the question that I am thinking about M and N, which became known only in runtime. – Curious May 18 '21 at 10:53
  • I added it to the question: M from 0 to 63, N from 0 to 64 – Curious May 18 '21 at 10:57
  • But this doesn't take the old Y and Z as inputs to update them, wasn't that part of the question @Curious ? That's how I understand "put into" anyway. And `N >> (64 - M)` should be `X >> (64 - M)` right? – harold May 18 '21 at 11:29
  • @nicomp Regarding the "put into" I think it doesn't really matter as it requires only to add additional `or` instruction, but yes I meant to update them, not overwrite. The second issue seems like a typo. – Curious May 18 '21 at 11:32
  • 1
    @Curious it is non-trivial to change the overwrite to an update though, not just an `or` instruction, but also masking to remove the old bits from the positions that need to be updated .. unless you know those bits are zero – harold May 18 '21 at 11:36
  • @nicomp You are right. I forgot about this case. – Curious May 18 '21 at 11:38
  • @Curious: If you want to wrap shift counts, see [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508) - with `m&63` and `n&63`, the compiler will hopefully still use the same asm instructions without any extra AND, because it knows that the x86 shift instructions mask their counts that way. (Although BZHI doesn't (https://www.felixcloutier.com/x86/bzhi) - it effectively saturates the low 8 bits of the count, clearing the register if n>63) – Peter Cordes May 18 '21 at 16:37
  • @PeterCordes The bit count in BZHI is saturated, which sets the CF, but in that case no bits are cleared (if I read the specification correctly). – Aki Suihkonen May 18 '21 at 17:17
  • Oh right, zero bits *above* n, not below: brain fart. Yeah, higher n makes it clear fewer bits, so not clearing any bits is the sensible and actual behaviour. – Peter Cordes May 18 '21 at 17:19