1

I have following function which generates the hash given a string.

uint64_t
djb2(uint8_t *ptr, uint64_t len) {
        uint64_t hash = 5381;
        uint64_t step = 0;
    
        do { // I can gurantee that [len > 0].
                hash = ((hash << 5) + hash) + ptr[step];
                ++step;
        } while (step < len);
    
        return hash;
} 

this generates following asm:

djb2:
        xor     edx, edx
        mov     eax, 5381
.L2:
        mov     rcx, rax
        sal     rcx, 5
        add     rax, rcx
        movzx   ecx, BYTE PTR [rdi+rdx]
        add     rdx, 1                   ; this is interesting.
        add     rax, rcx
        cmp     rdx, rsi
        jb      .L2
        ret

As you can see GCC with O2 prefers ADD 1 instead of INC.

Question

  • Is there a reason for that? If I'm not wrong there's no performance penalty but the code size of INC is smaller.
  • If this is really a missed thing by GCC (which I highly doubt, probably I'm wrong here), can I somehow fix it?

Note

  • You can play with this example in Godbolt.
  • I also thought that it picked ADD because it can execute multiple ADDs in parallel, but it is hard to say (I'm not an expert here).

EDIT

  • In the comments there are extremely valuable links, however I find it hard to understand the reasoning of picking ADD over INC in this example, based on the given answers (in those links).
  • @4386427 The answers say that `INC` is more preferred version. Am I wrong? If no, why does `GCC` put an `ADD` here? –  Apr 29 '21 at 12:18
  • Also read this https://stackoverflow.com/a/13383496/4386427 – Support Ukraine Apr 29 '21 at 12:19
  • @4386427 can I ask in which circumstances is `ADD` significantly faster than `INC`? There are links in the answer, but they are too long. I just want to understand the reasoning behind it (Also if I'm not wrong he talks about `x86` but there they say that `INC` is better on `x64`). –  Apr 29 '21 at 12:22
  • 2
    Hmm... "... they are too long." Processor architecture is a big and complicated topic so there's a lot of stuff to read. Anyway - if you can explain why the linked post doesn't answer your question, you can edit your question to get it reopened – Support Ukraine Apr 29 '21 at 12:30
  • @4386427 A) "why doesn't answer your question" already explained in my previous comment: I just don't understand why does it pick `ADD` when they(In your link) "clearly" say that `INC` is better. And the first link also doesn't tell me directly what's the reasoning behind my exact example here. B) "Processor architecture is a big and complicated topic" not going to argue against that, however do you think that I need all those +1000 pages for this question? Is there maybe a section in those manuals which would lighten up my question? –  Apr 29 '21 at 12:37
  • The linked posts doesn't clearly state that inc is better. As described in the various answers, it depends on a number of things. Anyway, if you edit your question and explain why the linked posts isn't answering your Q, I'll go for a reopen. – Support Ukraine Apr 29 '21 at 12:45
  • 2
    As per [INC instruction vs ADD 1: Does it matter?](https://stackoverflow.com/q/36510095) - there are a few microarchitectures where `inc reg` is somewhat slower, and `gcc -mtune=generic` (the default) is presumably trying to avoid problems on *any* uarches. At the cost of 1 byte of code (or 2 bytes in 32-bit mode), GCC devs decided it was worth it. GCC usually takes a conservative approach that might be a bit slower on most architectures to avoid slowdowns on a few, while clang tends to be more reckless. – Peter Cordes Apr 29 '21 at 12:58
  • 1
    It's a tradeoff. (And yes, if you want to really understand the tradeoff GCC is making in more detail than my last comment, you do need to read my linked answer. You can skim over the parts about SHL and flags, that's not directly relevant to INC vs. ADD. Understanding the Silvermont extra uop vs. everything-else point is important; that's the still-relevant tradeoff vs. code-size.) – Peter Cordes Apr 29 '21 at 12:59
  • @PeterCordes can I somehow "fix" it? Can I tell the `GCC` to not do that? –  Apr 29 '21 at 12:59
  • 3
    Yes, use `-mtune=intel`. Or if you have a CPU other than silvermont-family or Pentium 4, use `-mtune=native`. That should make it use `inc reg`, and `add [mem], 1`, which are the optimal choices on Sandybridge-family. (AMD might want to use `inc [mem]` as well, e.g. `-march=znver2`) – Peter Cordes Apr 29 '21 at 13:01
  • @PeterCordes Oh `-mtune=intel` didn't work but `-mtune=native` did. Thanks Peter. I'll find time and read your answer properly. –  Apr 29 '21 at 13:03
  • 1
    Oh right, I guess the CPUs that dislike `inc` *are* all Intel (P4 which `-mtune=generic` doesn't care about anymore, and Silvermont-family which it does). I was probably mixing up my memory of some other tuning choice where `-mtune=intel` does the good thing instead of the thing that caters to old obsolete CPUs, perhaps `rep ret` (AMD K10) – Peter Cordes Apr 29 '21 at 13:09

0 Answers0