Multiplying a struct (storing a float) by a float is coming out to be faster than simply multiplying a float by a float?

Question

I needed a way of initializing a scalar value given either a single float, or three floating point values (corresponding to RGB). So I just threw together a very simple struct:

struct Mono {
    float value;

    Mono(){
        this->value = 0;
    }

    Mono(float value) {
        this->value = value;
    };

    Mono(float red, float green, float blue){
        this->value = (red+green+blue)/3;
    };
};

// Multiplication operator overloads:
Mono operator*( Mono const& lhs, Mono const& rhs){
    return Mono(lhs.value*rhs.value);
};
Mono operator*( float const& lhs, Mono const& rhs){
    return Mono(lhs*rhs.value);
};
Mono operator*( Mono const& lhs, float const& rhs){
    return Mono(lhs.value*rhs);
};

This worked as expected, but then I wanted to benchmark to see if this wrapper is going to impact performance at all so I wrote the following benchmark test where I simply multiplied a float by the struct 100,000,000 times, and multipled a float by a float 100,000,000 times:

#include <vector>
#include <chrono>
#include <iostream>

using namespace std::chrono;

int main() {
    size_t N = 100000000;

    std::vector<float> inputs(N);
    
    std::vector<Mono> outputs_c(N);
    std::vector<float> outputs_f(N);

    Mono color(3.24);
    float color_f = 3.24;

    for (size_t i = 0; i < N; i++){
        inputs[i] = i;
    };

    auto start_c = high_resolution_clock::now();
    for (size_t i = 0; i < N; i++){
        outputs_c[i] = color*inputs[i];
    }
    auto stop_c = high_resolution_clock::now();
    auto duration_c = duration_cast<microseconds>(stop_c - start_c);
    std::cout << "Mono*float duration: " << duration_c.count() << "\n";

    auto start_f = high_resolution_clock::now();
    for (size_t i = 0; i < N; i++){
        outputs_f[i] = color_f*inputs[i];
    }
    auto stop_f = high_resolution_clock::now();
    auto duration_f = duration_cast<microseconds>(stop_f - start_f);
    std::cout << "float*float duration: " << duration_f.count() << "\n";

    return 0;
}

When I compile it without any optimizations: g++ test.cpp, it prints the following times (in microseconds) very reliably:

Mono*float duration:  841122
float*float duration: 656197

So the Mono*float is clearly slower in that case. But then if I turn on optimizations (g++ test.cpp -O3), it prints the following times (in microseconds) very reliably:

Mono*float duration:  75494
float*float duration: 86176

I'm assuming that something is getting optimized weirdly here and it is NOT actually faster to wrap a float in a struct like this... but I'm struggling to see what is going wrong with my test.

Numerous possibilities - compiler optimisers can reorder operations, omit operations as long as computed/output results (distinct from timing) are consistent with what the standard says. `high_resolution_clock` is not guaranteed to be good quality, and it is not unknown for some (poor) implementations to be non-monotonic i.e. time can seem to go backwards. You might try doing things in different order (e.g. do the `float*float` calcs first) and see what changes. Only way to be sure is to examine code emitted by your compiler. Also, consider using `steady_clock` instead. — Peter, Feb 16 '23 at 03:16
@Peter yeah I tried reordering it, and changing up the operations, but the behavior persisted. I unfortunately lack the know-how to dig into what the compiler is actually doing, so for now I'm going to have to try it and benchmark in my actual full use-case to evaluate any change of performance there. — Chris Gnam, Feb 16 '23 at 03:18

Peter Cordes · Answer 1 · 2023-02-16T04:39:24.587

On my system (i7-6700k with GCC 12.2.1), whichever loop I do second runs slower, and the asm for the two loops is identical.

Perhaps because cache is still partly primed from the inputs[i] = i; init loop when the first loop runs. (See https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ re: Intel's adaptive replacement policy for L3 which might explain some but not all of the entries surviving that big init loop. 100000000 floats is 400 MB per array, and my CPU has 8 MiB of L3 cache.)

So as expected from the low computational intensity (one vector math instruction per 16 bytes loaded + stored), it's just a cache / memory bandwidth benchmark since you used one huge array instead of repeated passes over a smaller array. Nothing to do with whether you have a bare float or a struct {float; }

As expected, both loops compile to the same asm - https://godbolt.org/z/7eTh4ojYf - doing a movups load, mulps to multiply 4 floats, and a movups unaligned store. For some reason, GCC reloads the vector constant of 3.24 instead of hoisting it out of the loop, so it's doing 2 loads and 1 store per multiply. Cache misses on the big arrays should give plenty of time for out-of-order exec to do those extra loads from the same .rodata address that hit in L1d cache every time.

I tried How can I mitigate the impact of the Intel jcc erratum on gcc? but it didn't make a difference; still about the same performance delta with -Wa,-mbranches-within-32B-boundaries, so as expected it's not a front-end bottleneck; IPC is plenty low. Maybe some quirk of cache.

On my system (Linux 6.1.8 on i7-6700k at 3.9GHz, compiled with GCC 12.2.1 -O3 without -march=native or -ffast-math), your whole program spends nearly half its time in the kernel's page fault handler. (perf stat vs. perf stat --all-user cycle counts). So that's not great; if you're not trying to benchmark memory allocation and TLB misses.

But that's total time; you do touch the input and output arrays before the loop (std::vector<float> outputs_c(N); allocates and zeros space for N elements, same for your custom struct with a constructor.) There shouldn't be page faults inside your timed regions, only potentially TLB misses. And of course lots of cache misses.

BTW, clang correctly optimizes away all the loops, because none of the results are ever used. Benchmark::DoNotOptimize(outputs_c[argc]) might help with that. Or some manual use of asm with dummy memory inputs / outputs to force the compiler to materialize arrays in memory and forget their contents.

See also Idiomatic way of performance evaluation?

Multiplying a struct (storing a float) by a float is coming out to be faster than simply multiplying a float by a float?

1 Answers1