5

Basically I am trying to understand the code at: https://gcc.godbolt.org/z/7xxb3G

void __attribute__((noinline))
cond_unset_bit(uint64_t * v, uint32_t b) {
    if(__builtin_expect(!!(*v & ((1UL) << b)), 1)) {
        *v ^= ((1UL) << b);
    }
}

compiles to:

cond_unset_bit(unsigned long*, unsigned int):
        movq    (%rdi), %rax
        btq     %rsi, %rax
        jnc     .L6
        btcq    %rsi, %rax
        movq    %rax, (%rdi)
.L6:
        ret

Based on Agner Fog's Instruction Table (skylake is pg 238) btq and btcq have the exact same cost when operating on a register. btcq will also set the carry flag to the previous bit so it seems the exact same logic (with better performance) could be accomplished w/o the btq instruction i.e:

cond_unset_bit(unsigned long*, unsigned int):
        movq    (%rdi), %rax
        btcq    %rsi, %rax
        jnc     .L6
        movq    %rax, (%rdi)
.L6:
        ret

What is the reason for including the btq?

I am tuning for x86_64 / intel skylake chip

Edit: Thanks @Peter Cordes (and for the help on all my other posts :)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Noah
  • 1,647
  • 1
  • 9
  • 18
  • 3
    Yup, that's a missed optimization. Report it on https://gcc.gnu.org/bugzilla/ with the "missed-optimization" keyword. Expect to find lots of minor missed optimizations if you look carefully at compiler output. GCC7.5 doesn't use `btc` at all, instead using mov / SHLX to materialize `(1UL) << b` and XOR to apply it. – Peter Cordes Aug 16 '20 at 01:05
  • I think there is might be merit to the ```shlx``` + ```xor``` approach if you have a prefetch on the memory and want to delay touching the memory till the last moment (assuming its not already hot). But as a general case that seems strange (and as far as I know there is no way to tell the compiler about how hot a given memory region will be). (obviously thats outside the scope of the question but curious if you have any thought on that @Peter Cordes) – Noah Aug 16 '20 at 01:13
  • Yes, I guess that could reduce the critical path latency to detect a branch mispredict by 1 cycle (from load result ready to starting branch recovery), because you to use `test` / `jz` (which can macro-fuse into a single uop) instead of `bt` + `jnc` which can't. But that's very minor, and `btc reg,reg` has 1 cycle latency on Skylake, same as `xor` or `bt`. So other than branch misprediction, It's pointless to prepare a mask in parallel with the `bt` when you could simply have the bit-flip done for free for the same cost (on Intel) as as `bt`, like GCC7 is doing. – Peter Cordes Aug 16 '20 at 01:21
  • 1
    Very much doesn't seem worth it though, to gain a small speedup (relative to total cost) for branch misses, when it costs multiple extra uops in the fast-path case. GCC might not have different tuning heuristics specialized for AMD vs. Intel in this case, though, and `btc r,r` on AMD is 2 uops vs. 1 for `bt`. Still doesn't seem worth it unless you expect the flip + store to be very rarely done, with most executions taking the branch and returning. – Peter Cordes Aug 16 '20 at 01:24
  • 1
    I see. Thank you for the extra answer (I guess would be ridiculous to include that then given the ```__builtin_expect(1)```). If it where the reverse maybe it would be worth it. – Noah Aug 16 '20 at 01:27
  • 1
    why not simple `btr`? – 0___________ Aug 16 '20 at 01:46
  • ```btc``` and ```btr``` performance the same task here, just a matter of style (I am explicitly only trying to unset bits with this code which fits ```btc``` better if you ask me). – Noah Aug 16 '20 at 02:17
  • @PeterCordes An unrelated question. I am writing semi complex function in inline assembly that is going to be on the HOT HOT HOT path in my code. Would it be appropriate to post it on stackoverflow asking about missed optimizations? (I only ask because you seem to be running the show in the micro-optimization category). – Noah Aug 16 '20 at 02:20
  • Unsetting bits is clearly what `btr` is for, "reset" is right in the name. `btc` is for flipping them. You could use `btr` without checking the original state; that was @P__J__'s point. If you expect the bit to usually be set, and thus need unsetting, an unconditional `btr` makes a lot of sense, not jumping over a `mov`. – Peter Cordes Aug 16 '20 at 02:44
  • (Especially if you can arrange to use just dword or byte size (movzx load, byte store) without too much hassle from strict-aliasing or getting the correct shift count within a narrower chunk), or causing store-forwarding stalls for some later reload. Then 2 bits in separate halves of a qword could be unset in parallel, instead of one having to wait for a store/reload after the first.) – Peter Cordes Aug 16 '20 at 02:44
  • re: code review: consider posting on https://codereview.stackexchange.com/, although posting here in the micro-optimization tag might work too. Especially if you phrase it as a description of the high-level problem you're trying to solve, and the inline asm is your current best attempt. That invites other totally different options, like SIMD if that's possible. I'd be more than happy to see a question like that on SO, even if it was a bit code-reviewish. If the main point is how that asm or equivalent executes quickly on a CPU, it feels more SO than codereview. – Peter Cordes Aug 16 '20 at 02:47
  • @P__J__ Good point (I got ```btc``` and ```btr``` flipped :/ ) – Noah Aug 16 '20 at 02:58
  • @PeterCordes I think I'll post here first because I care about perf > all else and I think optimization is probably the right tag for that. Thank you! – Noah Aug 16 '20 at 02:59
  • @PeterCordes if your interested my larger post: https://stackoverflow.com/questions/63433790/optimizing-percpu-2-level-bit-vector-in-restartable-sequence-on-x86-64-skylake – Noah Aug 16 '20 at 06:25

0 Answers0