Why C-style Arrays performance in O3 is less than no optimization on Quick Bench?

Question

Base on C-style Arrays vs std::vector using std::vector::at, std::vector::operator[], and iterators I run the following benchmarks.

no optimization https://quick-bench.com/q/LjybujMGImpATTjbWePzcb6xyck
O3 https://quick-bench.com/q/u5hnSy90ZRgJ-CQ75b1c1a_3BuY

From here, vectors definitely perform better in O3.

However, C-style Array is slower with -O3 than -O0

C-style (no opt) : about 2500
C-style (O3) : about 3000

I don't know what factors lead to this result. Maybe it's because the compiler is c++14?

(I'm not asking about std::vector relative to plain arrays, I'm just asking about plain arrays with/without optimization.)

Don't link websites. Include the relevant content in the question. The readers are unlikely to visit them. — Rohan Bari, Nov 22 '22 at 06:27
On that linked question, MooingDuck pointed out possible microbenchmark problems like page fault costs and CPU warmup. See [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for more about that. Constructing and resizing a `std::vector` typically results in writing the memory, which will pre-fault it, so you're not paying the cost of those page-faults inside the timed region like you would when touching a big C array for the first time in a program. — Peter Cordes, Nov 22 '22 at 06:35
If that's what's going on here, duplicate of [Why is iterating though \`std::vector\` faster than iterating though \`std::array\`?](https://stackoverflow.com/q/57125253) — Peter Cordes, Nov 22 '22 at 06:35
Unclear what's going on. If you look at the asm on quickbench, `BM_map_vector_v1` and `v2` have more instructions since you used `.at()` for bounds-checking. Not super surprising if they're slower, although that might not be the reason. Your C array versions are using locals on the stack, and not so big that you'd expect whole pages of zero-backed memory. And you allocate arrays / vectors only once vs. writing many times. — Peter Cordes, Nov 22 '22 at 06:55
`vector_size` of 4096 times sizeof(int) = 4 is only 16KiB so in+out just fills L1d cache. But `map` of 2000 * 8B = 16kB is another half of that. The map is read sequentially and output written sequentially, as you "gather" elements from `in`. At worst you'll be getting some L2 hits from `in`, and HW prefetch works for `out` and `map`. — Peter Cordes, Nov 22 '22 at 06:59
Could be a code-alignment problem; I re-ran it on quick-bench after commenting `//static const std::size_t num_iterations = 1000000;` to see if it had been getting used at all (I don't think it was). I found more sensible results: everything the same (including vector_v3) except for the vector versions 1 and 2 using `.at()`, which puts so many insns in the loop it creates a front-end bottleneck. — Peter Cordes, Nov 22 '22 at 07:04
As well as having the most insns, `BM_map_vector_v1` might have run into the JCC erratum ([How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646)) if QB runs on Skylake-X CPUs. QuickBench doesn't have ways to add GCC options like the necessary `-Wa,-mbranches-within-32B-boundaries`, unfortunately. — Peter Cordes, Nov 22 '22 at 07:05
(Just realized I had been misinterpreting your question; you were only comparing C array -O0 vs. C array -O3, so across different QB runs with a different timing baseline. Posted an answer about that; I'll leave comments in case anyone's interested in that analysis of the -O3 results.) — Peter Cordes, Nov 22 '22 at 07:52

score 3 · Accepted Answer · answered Nov 22 '22 at 07:41

Your -O0 code wasn't faster in an absolute sense, just as a ratio against an empty
for (auto _ : state) {} loop.

That also gets slower when optimization is disabled, because the state iterator functions don't inline. Check the asm for your own functions, and instead of an outer-loop counter in %rbx like:

      # outer loop of your -O3 version
       sub    $0x1,%rbx
       jne    407f57 <BM_map_c_array(benchmark::State&)+0x37>

RBX was originally loaded from 0x10(%rdi), from the benchmark::State& state function arg.

You instead get state counter updates in memory, like the following, plus a bunch of convoluted code that materializes a boolean in a register and then tests it again.

# part of the outer loop of your -O0 version
12.50%   mov    -0x8060(%rbp),%rax
25.00%   sub    $0x1,%rax
12.50%   mov    %rax,-0x8060(%rbp)

There are high counts on those instructions because the call map_c_array didn't inline, so most of the CPU time wasn't actually spent in this function itself. But of the time that was, about half was on these instructions. In an empty loop, or one that called an empty function (I'm not sure which Quick Bench is doing), that would still be the case.

Quick Bench does this to try to normalize things for whatever hardware its cloud VM ends up running on, with whatever competing load. Click the "About Quick Bench" in the dropdown at the top right.

And see the label on the graph: CPU time / Noop time. (When they say "Noop", they don't mean a nop machine instruction, they mean in a C++ sense.)

An empty loop with a loop counter runs about 6x slower when compiled with optimization disabled (bottlenecked on store-to-load forwarding latency of the loop counter), so your -O0 code is "only" a bit less than 6x slower, not exactly 6x slower.

With a counter in a register, modern x86 CPUs can run loops at 1 cycle per iteration, like looptop: dec %ebx / jnz looptop. dec has one cycle latency, vs. subtract or dec on a memory location being about 6 cycles since it includes the store/reload. (https://agner.org/optimize/ and https://uops.info/. Also

The performance of two scan functions (benchmarked without optimization; my answer explains that they bottleneck on store-forwarding latency.)
Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
Adding a redundant assignment speeds up code when compiled without optimization (Intel Sandybridge-family store-forwarding has variable latency depending on how soon you try to reload).

With that bottleneck built-in to the baseline you're comparing against, it's normal that adding some array-access work inside a loop won't be as much slower as array access vs. an empty loop.

score -2 · Answer 2 · answered Nov 22 '22 at 07:41

-2

Because you aren't benchmarking what you think you're benchmarking. I bothered to look at your code, and found that you're trying to see how fast your CPU can advance the counter in a for loop while seeing how fast your data BUS can transfer data. Is this really something you need to worry about, like ever?

In general, benchmarks outside multi-thousand programs are worthless and will never be taken with a straight face by anyone even remotely experienced in programming, so stop doing that.

answered Nov 22 '22 at 07:41

yotsugi

68
1
5

You're right that they're not benchmarking what they think they are. It's a ratio of their loop against an empty loop, which also gets slower at `-O0`, so a smaller ratio doesn't mean faster in an absolute sense.. – Peter Cordes Nov 22 '22 at 07:46
Microbenchmarks are meaningful if you do them right. Timing a gather loop is a semi-reasonable thing to do. The arrays are small so fit in L2 cache, almost in L1d, so the only bus involved is the one between L1d and L2 cache. So performance depends on memory-level parallelism, how many L1d misses can be in flight at once. (Also some L1d hits, out of order exec does have to hide that latency). Or with enough L1d hits in the random access, just on read and write bandwidth from `map` and the first part of `out`. – Peter Cordes Nov 22 '22 at 07:50
@PeterCordes no they are not, I didn't even bother reading your condescending comment because I will not give up `std::vector` just to malloc an array myself for some arbitrary and irrelevant numerical difference that's imperceivable in a real world program that does something useful. You can find something to optimize everywhere, except you will never need to optimize something that say, is called literally once in entire program while more important things could be optimized instead. Whole program matters, its parts do not. – yotsugi Nov 22 '22 at 07:52
The actual question isn't asking about the performance ratio between C arrays and `std::vector` (with optimization disabled or enabled). I thought that at first, too, but they're actually asking about the C array `-O0` vs. C array `-O3` times as reported by quick-bench. Which doesn't report in milliseconds, instead as a *ratio* against an empty loop. Nowhere am I suggesting that anyone should give up `std::vector`. If you want to be wrong without reading comments that point out your mistake, that's up to you but I wouldn't recommend it. – Peter Cordes Nov 22 '22 at 07:56
@PeterCordes I'm not wrong, that's the issue here. And yeah, sure, good cope, on a C++ question. I gave an example that makes sense because I expected you to be sensibly intelligent and not an obtuse SO point collector that will argue about nothing and will benchmark nothing. The matter of fact is that real programmers will use `std::vector` and will move on with their life while you continue collecting SO points over trivial and irrelevant mechanics of something that will never be useful. The only good answer is that "it doesn't matter why". It's like asking why shit smells, it just does. – yotsugi Nov 22 '22 at 17:23
IDK what you think the actual question was asking, but it's not whether they should use `std::vector` or not. (Their original motivation might have been to find out the cost of `std::vector`, just to verify that it's as cheap as C arrays, at least with optimization enabled. IDK, and if so this only covers one aspect of it since any overhead optimizes away out of the loop.) But anyway, that's not what they actually asked in their question, that's just background noise to what they did ask about quick bench and un-optimized code (which is also useless to benchmark). – Peter Cordes Nov 22 '22 at 17:32
@PeterCordes and I made this answer to point out that it's irrelevant what's faster. You should be optimizing programs, not data structures that were programmed decades ago and continually improved over time. I don't get how can so many people be so dense all at once. Your program performs slowly? Well then look at data structures you're using and then check for better things. Arrays will never perform worse with O3 compared to O0, given that your program does real work that is absolutely necessary to do, ***you will only find differences in irrelevant bechmarks like this***. – yotsugi Nov 23 '22 at 10:36
Ok, so you're not interested in directly answering the question. (Which was about a misinterpretation of the results which made the OP think -O3 was slower, which of course would be surprising if it were true. But it's not, `-O3` *is* actually faster as expected, probably about 4x to 5x.) Posting that "microbenchmarks aren't useful" is not a useful answer to a benchmarking question when that's *all* you're saying, not addressing the actual question at all. At best that could be a comment. – Peter Cordes Nov 23 '22 at 10:52
It's true that you have to understand what you're doing (in terms of compilers, CPUs, and assembly) to learn anything meaningful from a small benchmark like this; they don't always tell you what will be faster as part of a larger program. But they're not always useless. But this one did correctly discover that using the `.at()` member function instead of `operator[]` had overhead that was significant for this tight loop, and nothing else mattered, not C array vs. std::vector. (When you look at the `-O3` benchmark, with the different functions.) – Peter Cordes Nov 23 '22 at 10:54
@PeterCordes "they're not always useless" yes they are, modern CPU's are complex, and so are modern programs, we aren't programming PDP11 anymore to do a micro benchmark and assume something in general case when writing programs. You will need to go ahead and benchmark the real program on real hardware to know whether it's faster or slower, because your microoptimization that's very cool and fast on paper, may actually turn out to make the overall program slower, for example inlining every single function ever. Also I can't comment because of geniuses like you who have nothing better to do. – yotsugi Nov 23 '22 at 10:56
Pitfalls like cache footprint and only measuring the hot-cache case are part of what I meant by not always telling you what's faster as part of a real program. Of course modern CPUs aren't simple, but some people have to understand them in order to keep making compilers better. (Or at least stopping them from regressing when their code-gen choices change.) That's why Intel has an asm optimization manual, and people like Agner Fog have spent time writing [a microarchitecture guide](https://agner.org/optimize/) for modern x86 CPUs. – Peter Cordes Nov 23 '22 at 11:27

Why C-style Arrays performance in O3 is less than no optimization on Quick Bench?

2 Answers2