Why is struct{u64} faster than struct{u32,u32}?

Question

Huh, ideone shows it slower. In g++ 4.9.3 in my VM (ubuntu werewolf) I get A: 830 B: 460. In either case why is one faster than the other? I assumed A is being push by reference and B is by value but I don't see why it would do that. I tried using operator<(A a, ... and that didn't help.

struct A {
    u32 first, second;
    A(int f, int s):first(f),second(s) {}
    bool operator<(u32 b) const { return first < b; }
};
//bool operator<(const A&a, u32 b) { return a.first < b; }

struct B {
    u64 v;
    B(int f, int s) { v = ((long)f << 32) | s; }
    bool operator<(u32 b) const {
        return (v >> 32) < b;
    }
};
u32 get_value(deque<A>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
        return p->second;
    else
        return UINT_MAX;
}
u32 get_value(deque<B>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
        return p->v & 0xFFFFFFFF;
    else
        return UINT_MAX;
}

int main(int argc, char *argv[]) {

    {
        deque<A> d;
        struct timeval s, e;
        gettimeofday(&s, 0);
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(A(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        gettimeofday(&e, 0);
        auto sec = e.tv_sec - s.tv_sec;
        auto usc = e.tv_usec - s.tv_usec;
        printf("A %ld\n", v);
        printf("Time: %lu\n", sec * 1000 + usc / 1000);
    }
    {
        deque<B> d;
        struct timeval s, e;
        gettimeofday(&s, 0);
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(B(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        gettimeofday(&e, 0);
        auto sec = e.tv_sec - s.tv_sec;
        auto usc = e.tv_usec - s.tv_usec;
        printf("A %ld\n", v);
        printf("Time: %lu\n", sec * 1000 + usc / 1000);
    }
}

Peter Cordes · Accepted Answer · 2015-08-23T05:38:22.483

I think you're getting a speedup with struct B { u64 v; } because you're using it with compile-time-constant args. The compiler can combine both values into a 64bit store.

I made a simple non-member function to get asm output produced for the non-inline case of operater< for A and B. As godbolt shows, the asm for struct A::operator< is significantly more efficient. B::operator< really does do a 64bit load, then shift. I commented the asm for the benefit of people that don't know x86 very well.

bool comp_A(const struct A &lhs, u32 rhs) { return lhs < rhs; }
bool comp_B(const struct B &lhs, u32 rhs) { return lhs < rhs; }

comp_A(A const&, unsigned int):
    cmpl    %esi, (%rdi)  # compare with lhs->first as a memory operand
    setb    %al
    ret
comp_B(B const&, unsigned int):
    movq    (%rdi), %rax  # load lhs->v
    movl    %esi, %esi    # zero-extend the low 32b of rhs
    shrq    $32, %rax     # shift the high32 of v down to the low 32
    cmpq    %rsi, %rax
    setb    %al           # al=1 if rax "bigger than" rsi, else 0.
    ret

x86 has fast hardware support for loading any size integer from RAM into a 32 or 64bit temporary (in a register). movzx and movsx zero or sign extend, and are as cheap as regular loads with mov. 32bit ops always zero the high 32 of the dest register, so false dependencies on old values aren't a problem. (They are for 16 and 8bit ops, which is why movzx to a 32bit temporary is a good plan.)

I haven't looked at the asm of your actual functions, as it's quite large. I'd suggest using version A in general, though. It may be that gcc isn't copying it around with 64bit moves, maybe because it might not be aligned? x86 doesn't require alignment (except for non-AVX (legacy-SSE) 16byte memory operands). However, IIRC it's only since about 2008 or 2009 CPUs (Intel's Nehalem) that unaligned loads/stores had no penalty as long as they didn't cross a cache line. gcc may still be reluctant to use them, since the potential gain is smaller than the potential downside on old CPUs with slow unaligned access.

You might get gcc to give you the best of both worlds with a union.

union A { u64 v; u32 first_last[2]; };

This might induce the compiler to use 64bit moves when copying it around, but still do 32bit loads of A.first_last[0] and not need to shift when you're accessing individual fields.

Hmm. Just changing A constructor to use v made it as fast as B. There was still 20ms difference and swapping the test around (b first than A) produced the same results but flipped. The second case is typically 20ms faster. The shift VS using a 32bit variable seemed not to affect anything. This answer is accepted. — , Aug 23 '15 at 06:42
Thats a nice site I can never remember how to produce asm code from gcc (especially optimized asm). But now that I look I remember I can never wade through it. I don't see 'emplace_back' but I see push_back. I don't know where the constructor is. I don't see gettimeofday. I can't find anything actually there's too much assembly. I can read a portion of it but it doesn't help me find the lines I want to look at — , Aug 23 '15 at 06:49
@acidzombie24: oh yeah, I remember I thought of asking if you'd tried the other order. But then I left it out. Some warm-up time is normal, because of caches, and CPU power-saving. It takes several milliseconds of full-load to get the CPU to jump to max frequency. — Peter Cordes, Aug 23 '15 at 11:56
@acidzombie24: optimized asm locally: I like to use `gcc foo.c -Wall -O3 -ffast-math -march=native -fverbose-asm -S -o- | less`. godbolt is just useful for trying different compiler versions, and for making permalinks to post here. :) But yeah, C++ STL code results in asm that's too complex to wade through. That's why I just wanted to see what `operator<` compiled to, in the case where it didn't inline and couldn't take advantage of a compile-time constant. — Peter Cordes, Aug 23 '15 at 12:00

score 0 · Answer 2 · answered Aug 23 '15 at 04:25

0

Working with register sized variables is always fastest, otherwise you have to deal with potentially unaligned data and cropped operations.

I think it's a safe assumption that ideone is using a 64-bit compiler, so that explains that.

answered Aug 23 '15 at 04:25

Blindy

65,249
10
91
131

You mean 32bits because its slower in ideone case. Hmm, that could be it. So `>>32` is faster than using 2 32bit register? Seems a little odd still. I believe my VM is 64bits so you could be right – Aug 23 '15 at 04:29
It's not that odd, memory misalignment is a pretty big issue. That's why most C++ compilers have some form of `#pragma pack`. – Blindy Aug 23 '15 at 04:40
but using both structs as two 32bit values. There shouldnt be any alignment issue. Unless you're saying it aligns both 32bits into 2 different 64bit registers? but I don't see why it'd be slower because than my value would be 64bit and it should compare the same way. FYI to be clear A which is 2x32bits is slower than 1x64bits + bitshifting. – Aug 23 '15 at 04:46
You're treating the two 32bit values independently, thus requiring alignment to be done on one of them. Forget what it looks like to you, assembler only has `mov` (and co), and `mov` has rules. – Blindy Aug 23 '15 at 04:47
1

@Blindy: Actually, on x86-64, working with one 32bit variable is *slightly* faster in some cases than working with one 64bit variable. The default operand size is still 32bit for most instructions in long mode, so a `REX` prefix byte is needed to change `add %eax, %esi` to `add %rax, %rsi`. (If the instruction is using one of `%r8 - %r15`, then you already need a REX prefix, and you just set the `W` bit in the `REX` byte to make it a 64bit op.) – Peter Cordes Aug 23 '15 at 04:52

score 0 · Answer 3 · answered Aug 23 '15 at 05:14

Getting accurate high-resolution timings is very hard. The gettimeofday function is not guaranteed to be microsecond resolution and can be effected by many other system processes (ntpd, multi-threads, etc.). One thing that can get you closer is to use a tick-timer if you're on a machine that supports reading the clock register (like Intel processors).

Using your code and a tick-timer I noted that the fastest time moved from one structure to the other (I'm not running on a VM).

#include <cstdio>
#include <deque>
#include <algorithm>
#include <climits>
#include <sys/time.h>
#include <stdint.h>
using namespace std;

typedef unsigned int u32;
typedef unsigned long long u64;
static_assert(sizeof(u32)==4 && sizeof(u64)==8, "Fail");

#define CLOCK_TICKS 2400000000  // my machine clock is 2.4 Ghz

struct A {
    u32 first, second;
    A(int f, int s):first(f),second(s) {}
    bool operator<(u32 b) const { return first < b; }
};
//bool operator<(const A&a, u32 b) { return a.first < b; }

struct B {
    u64 v;
    B(int f, int s) { v = ((u64)f << 32) | s; }
    bool operator<(u32 b) const {
    return (v >> 32) < b;
    }
};
u32 get_value(deque<A>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
    return p->second;
else
    return UINT_MAX;
}
u32 get_value(deque<B>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
        return p->v & 0xFFFFFFFF;
    else
        return UINT_MAX;
}

// This function was originally published in the Intel hardware manual
// for the first chips that supported the SSE instruction set.  I've
// seen variations on this for 12+ years and do not know where to
// attribute its origin

 inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ __volatile__ (
      "xorl %%eax, %%eax\n"
      "cpuid\n"
      "rdtsc\n"
      : "=a" (lo), "=d" (hi)
      :
      : "%ebx", "%ecx");
    return (uint64_t)hi << 32 | lo;
}

int main(int argc, char **argv)
{
    uint64_t s, e;
    double elapsed;
    s = rdtsc();   // warm up the timer
    {
        deque<A> d;

        s = rdtsc();
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(A(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        e = rdtsc();
        elapsed = ((double)e - (double)s) / (double)CLOCK_TICKS;
        printf("A %ld\n", v);
        printf("Time: %lf\n", elapsed);
    }
    {
        deque<B> d;
        s = rdtsc();
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(B(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        e = rdtsc();

        printf("B %ld\n", v);
        elapsed = ((double)e - (double)s) / (double)CLOCK_TICKS;
        printf("Time: %lf\n", elapsed);
    }   
}

you use [`clock_gettime`](http://stackoverflow.com/q/6749621/995714) in Linux and [`QueryPerformanceCounter`](https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904%28v=vs.85%29.aspx) in Windows to get high resolution timer. No need to write your own — phuclv, Aug 23 '15 at 06:10

Why is struct{u64} faster than struct{u32,u32}?

3 Answers3