2

Here is an example piece of code:

#include <stdint.h> 
#include <iostream>

typedef struct {
    uint16_t low;
    uint16_t high;
} __attribute__((packed)) A;

typedef uint32_t B;

int main() {
    //simply to make the answer unknowable at compile time
    uint16_t input;
    cin >> input;
    A a = {15,input};
    B b = 0x000f0000 + input;
    //a equals b
    int resultA = a.low-a.high;
    int resultB = b&0xffff - (b>>16)&0xffff;
    //use the variables so the optimiser doesn't get rid of everything
    return resultA+resultB;
}

Both resultA and resultB calculate the exact same thing - but which is faster (assuming you don't know the answer at compile time).

I tried using Compiler Explorer to look at the output, and I got something - but with any optimisation no matter what I tried it outsmarted me and optimised the whole calculation away (at first, it optimised everything away since it's not used) - I tried using cin to make the answer unknowable at runtime, but then I couldn't even figure out how it was getting the answer at all (I think it managed to still figure it out at compile time?)

Here is the output of Compiler Explorer with no optimisation flag:

        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     dword ptr [rbp - 4], 0
        movabs  rdi, offset std::cin
        lea     rsi, [rbp - 6]
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
        mov     word ptr [rbp - 16], 15
        mov     ax, word ptr [rbp - 6]
        mov     word ptr [rbp - 14], ax
        movzx   eax, word ptr [rbp - 6]
        add     eax, 983040
        mov     dword ptr [rbp - 20], eax
Begin calculating result A
        movzx   eax, word ptr [rbp - 16]
        movzx   ecx, word ptr [rbp - 14]
        sub     eax, ecx
        mov     dword ptr [rbp - 24], eax
End of calculation
Begin calculating result B
        mov     eax, dword ptr [rbp - 20]
        mov     edx, dword ptr [rbp - 20]
        shr     edx, 16
        mov     ecx, 65535
        sub     ecx, edx
        and     eax, ecx
        and     eax, 65535
        mov     dword ptr [rbp - 28], eax
End of calculation
        mov     eax, dword ptr [rbp - 24]
        add     eax, dword ptr [rbp - 28]
        add     rsp, 32
        pop     rbp
        ret

I will also post the -O1 output, but I can't make any sense of it (I'm quite new to low level assembly stuff).

main:                                   # @main
        push    rax
        lea     rsi, [rsp + 6]
        mov     edi, offset std::cin
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
        movzx   ecx, word ptr [rsp + 6]
        mov     eax, ecx
        and     eax, -16
        sub     eax, ecx
        add     eax, 15
        pop     rcx
        ret

Something to consider. While doing operations with the integer is slightly harder, simply accessing it as an integer easier compared to the struct (which you'd have to convert with bitshifts I think?). Does this make a difference?

This originally came up in the context of memory, where I saw someone map a memory address to a struct with a field for the low bits and the high bits. I thought this couldn't possibly be faster than simply using an integer of the right size and bitshifting if you need the low or high bits. In this specific situation - which is faster?

[Why did I add C to the tag list? While the example code I used is in C++, the concept of struct vs variable is very applicable to C too]

Fareanor
  • 5,900
  • 2
  • 11
  • 37
ichigo
  • 21
  • 1
  • x86 supports 16 bit loads, see the `movzx eax, word ptr [rbp - 16]`. That is going to be the best. If the compiler recognizes the second version and optimizes it to the same accesses then both will be equally fast of course. – Jester Jan 12 '23 at 13:48
  • 1
    If you want to look at asm for a runtime variable, write a function that takes an arg and returns a value. No need to bring `cin.operator>>` into it. [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) – Peter Cordes Jan 12 '23 at 13:52
  • If you don't enable optimization, there's no point discussing anything. ([How to optimize these loops (with compiler optimization disabled)?](https://stackoverflow.com/q/32000917) / [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394). With optimization, GCC is usually pretty good about seeing unpacking halves of a single integer, although it might sometimes use scalar 32-bit shifts instead of separate loads. – Peter Cordes Jan 12 '23 at 13:55
  • 1
    https://godbolt.org/z/EvrGzPnfE has two separate functions. Actually 3, one that does the same thing with `b` as with `a`, compiling to the same asm, and one with your code to show why your asm is weird: `warning: suggest parentheses around '-' in operand of '&' [-Wparentheses]`. If it had to load them from memory via a reference or pointer, then there's a difference in code-gen: https://godbolt.org/z/3efjYxa18 – Peter Cordes Jan 12 '23 at 13:59
  • C `struct` and C++ `struct` are not the same thing, hence why it's better to target a specific language, because the answer may be completely different. It would be better to ask two separate questions for each language than mixing the two in one. – Fareanor Jan 12 '23 at 13:59
  • @Fareanor: There's no difference in how they optimize, especially when the struct doesn't have any member functions. I don't object to removing the [c] tag, but it wasn't really necessary. (The `typedef struct` instead of just `struct` is written to be compatible with C, but then the code uses `cin>>` for no reason, instead of a function arg or `volatile` to defeat optimization. So yeah, clearly this is C++ code in this instance, but your argument that they're not the same thing doesn't really hold water, and it would definitely not be better to ask 2 separate questions.) – Peter Cordes Jan 12 '23 at 15:12
  • @PeterCordes This is what I get for not putting -Wall in the flags straight away! I didn't even think about operator precedence when writing that, I just assumed. In my head it seemed 'right', but I guess that's because I don't often see the bitwise and operator about so I don't have an intuitive sense of it's place. – ichigo Jan 12 '23 at 23:47

3 Answers3

3

Other than the fact that some ABIs require that structs be passed differently than integers, there won't be a difference.

Now, there are important semantic differences between two 16 bit ints and one 32 bit int. If you add to the lower 16 bit int, it will not "overflow" into the higher one, while if you add to the lower 16 bits of a 32 bit int, it will. This difference in possible behavior (even if you, yourself, "know" it could not happen in your code) could change what assembly code is generated by your compiler, and impact performance.

Which of those two would result in a faster result is not going to be knowable without actually testing or a full description of the actual exact problem. So it is a toss up there.

Which means the only real concern is the ABI one. This means, without whole program optimization, a function taking a struct and a function taking an int with the same binary layout will have a different assumptions about where the data is.

This only matters for by-value single arguments however.

The 90/10 rule applies; 90% of your code runs for less than 10% of the time. The odds are this will have no impact on your critical path.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • In practice there is a code-gen difference when operating on a struct in memory (https://godbolt.org/z/3efjYxa18); if the packed value is already in a register GCC makes the same asm for both functions (if implemented correctly; the OP already has a bug in their `&` vs. `-` operator precedence, so that's another reason to avoid it.) – Peter Cordes Jan 12 '23 at 15:10
  • 1
    @PeterCordes I think `bar` accidentally does different signed math there (in the shift right?). Even if I'm wrong, it is really easy for the two operations to not be identical. – Yakk - Adam Nevraumont Jan 12 '23 at 16:14
  • `typedef uint32_t B` means `(b>>16)` is an unsigned shift. GCC would use `shr` even if you removed the `&0xffff`. (Although it could potentially optimize signed right shift plus `&0xffff` into `shr` instead of `sar`+`and`.) – Peter Cordes Jan 12 '23 at 21:52
  • Oh, but the `uint16_t` version does signed `int` subtraction after integer promotion. The shift/AND version does `uint32_t` subtraction (when compiling for x86 and other typical ABIs where `uint32_t` is at least as wide as `int`), and then conversion to `int` happens later. On systems with 32-bit or wider `int`, both ways are guaranteed to give the same values in the end, though, since the subtraction can't overflow. And with 16-bit `int`, promotion wouldn't happen so you'd get wrapping 16-bit unsigned subtract, then conversion to int, same(?) as modulo-reduction of `uint32_t` subtraction. – Peter Cordes Jan 12 '23 at 21:58
  • But with different math, the different type could matter. That's more a matter of C unsigned to signed `int` promotion being tricky with the original `struct` version, though; the version where everything is `uint32_t` is probably closer to intended. – Peter Cordes Jan 12 '23 at 22:00
2

When trying to answer questions of performance, examining unoptimized code is largely irrelevant.

As a matter of fact, even examining the results of -O1 optimization is not particularly useful, because it does not give you the best that the compiler can achieve. You should try at least -O2.

Regardless of the above, the sample code you provided is unsuitable for examination, because you should be making sure that the values of a and b are separately unknowable by the compiler. As the code stands, the compiler does not know what the value of input is, but it does know that a and b will have the same value, so it optimizes the code in ways that make it impossible to derive any useful conclusions from it.

As a general rule, compilers tend to do an exceptionally good job when dealing with structs that fit within machine words, to the point where generally, there is absolutely no performance difference between the two scenarios you are considering, and between any of the special cases you are pondering about.

Mike Nakis
  • 56,297
  • 11
  • 110
  • 142
2

Using GCC on compiler explorer the version with the struct produces fewer instructions in -O3 mode.

Code:

#include <stdint.h> 

typedef struct {
    uint16_t low;
    uint16_t high;
} __attribute__((packed)) A;

typedef uint32_t B;

int f1(A a)
{
    return a.low - a.high;
}

int f2(B b)
{
    return b&0xffff - (b>>16)&0xffff;
}

Assembly:

_Z2f11A:
    movzwl  %di, %eax
    shrl    $16, %edi
    subl    %edi, %eax
    ret
_Z2f2j:
    movl    %edi, %edx
    movl    $65535, %eax
    shrl    $16, %edx
    subl    %edx, %eax
    andl    %edi, %eax
    ret

But this might be because the two functions don't do the same thing as - has a higher precedence than &. When comparing the B case which does the same thing as A, then the exact same assembly is produced.

Code:

int f3(B b)
{
    return (b&0xffff) - ((b>>16)&0xffff);
}

Assembly:

_Z2f3j:
    movzwl  %di, %eax
    shrl    $16, %edi
    subl    %edi, %eax
    ret

Note that the only way to find out if something is faster is to benchmark it in a real world use case.

Mestkon
  • 3,532
  • 7
  • 18
  • 2
    You forgot to mention that `f2` doesn't do the same thing as `f1`, because of missing `()` around the `&` operators. Note in the asm the `sub` before `and`. I mentioned this in my [earlier comment](https://stackoverflow.com/questions/75097238/which-is-faster-a-struct-or-a-primitive-variable-containing-the-same-bytes/75098021#comment132523842_75097238) on the question. That's why the asm is different. With the bug spotted by `gcc -Wall` fixed, the asm is the same. – Peter Cordes Jan 12 '23 at 15:14
  • @PeterCordes Thanks! I didn't even notice that. – Mestkon Jan 12 '23 at 15:16
  • 1
    Always a good idea to use `-Wall` on Godbolt, to catch surprises like this. Especially other people's code, and/or when the asm is different but you think it could/should be the same. – Peter Cordes Jan 12 '23 at 15:24