Why is std::count_if faster or slower than an explicit loop depending on optimization settings?

Question

I have 2 different pieces of code but resulting in the same thing.

The first one is using the iterative method:

const std::vector<int> vec {1,8,2,96,47,23};
int sum {};
for(const auto& i: vec)
   sum += (i % 2) ? 1 : 0;

And the second piece of code is the following:

const std::vector<int> vec {1,8,2,96,47,23};
const int sum = std::count_if(vec.begin(), vec.end(), [](const int& i){return i % 2 != 0;});

My question is why when using the -O1 optimization the iterative method wins and when using -O3 or -Ofast the algorithmic method wins.

I expected that the algorithmic method will win in either cases.

Can someone explain?

what is the meaning of "win" ? Do you measure the runtime? How do you do that? — 463035818_is_not_an_ai, Nov 05 '21 at 08:14
`std::count_if` will still be an "iterative method" as it can't function without a loop inside. — Some programmer dude, Nov 05 '21 at 08:15
please show the code you used to measure the runtime and also include the results in the question — 463035818_is_not_an_ai, Nov 05 '21 at 08:17
yes I know, but what I meant is using the ready methods than creating our iteration with the range-based for loop or other — Bol, Nov 05 '21 at 08:18
small side note: `sum += (i % 2) ? 1 : 0;` is equal to `sum += (i % 2);` — justANewb stands with Ukraine, Nov 05 '21 at 08:20
`gcc -O1` doesn't include `-finline-functions`, so STL template functions and lambdas turn into a rats nest of function calls. I'd expect both ways to compile to the same asm with full optimization; are you sure that wasn't a benchmarking problem, like optimizing away or CPU warm-up? ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) 5 elements is a tiny problem, and if it's const it may all optimize away. — Peter Cordes, Nov 05 '21 at 08:20
@justANewbie: Not exactly, `i%2 == (i%2 != 0)` is only true for unsigned. For signed `int`, `i % 2` can be -1, 0, or +1, and without constant-propagation for the compiler to know that all the values are non-negative, it would have to actually implement that logic with extra asm instructions. `sum += (i%2 != 0)` or the ternary can actually optimize to `sum += i & 1;` on 2's complement machines, i.e. every modern ISA. — Peter Cordes, Nov 05 '21 at 08:22
@Someprogrammerdude, then my question is why if std::count_if still an iterative method, I got different results in benchmarking? — Bol, Nov 05 '21 at 08:22
@PeterCordes Yes, you're right. I automatically assume `i` is positive. My bad. — justANewb stands with Ukraine, Nov 05 '21 at 08:24
Not all ways of iterating optimize to the same machine code. @Someprogrammerdude is pointing out a word-choice problem in your description, not answering the actual performance question. (That problem is fixed now, thanks to mkrieger1's edit, leaving just the performance question. You didn't say what compiler you used, or show the benchmark code you used, though.) — Peter Cordes, Nov 05 '21 at 08:24
GCC 11.2 with `-O3` optimization [compiles both to the same assembly code](https://godbolt.org/z/j11Yn6nY9). — Some programmer dude, Nov 05 '21 at 08:29
@PeterCordes, I'm using Clang 12.0 and sometimes GCC 10.3. and I'm using the Quick C++ Benchmark — Bol, Nov 05 '21 at 08:35
@Someprogrammerdude, how is that? it compiles the same assembly code? I see for the ```-O0``` 217 lines and for the ```-O3``` 76 lines of assembly code — Bol, Nov 05 '21 at 08:38
@Bol: Nobody's saying -O0 and -O3 compile the same. Only that with only `-O1`, the two versions of the source very likely compile to different asm. But at -O3 the two source versions optimize to the same asm. Look at the `sum1` and `sum2` functions in the right-hand compiler pane, the `-O3` one. The `-O0` one is useless, IDK why Someprogrammerdude included that, since you used at least `-O1` for your benchmark. — Peter Cordes, Nov 05 '21 at 08:42
https://godbolt.org/z/vqGq1757d shows that clang 12.0 -O1 compiles both to garbage, not inlining a bunch of functions. https://godbolt.org/z/W7fabczzz shows that clang 13.0 -O1 does inline the std::vector accessor functions in the range-for loop. — Peter Cordes, Nov 05 '21 at 08:47
@PeterCordes I got this. can you take a look https://quick-bench.com/q/8q7CKBmR-G5czjsTY5qFPNRwTJg — Bol, Nov 05 '21 at 08:53
Interesting. If you look at the asm, you can see the `count_if` version auto-vectorized the and / add for 4 of the 6 elements, but the range-for didn't for some reason, with clang 12.0 -O3 using libstdc++ headers. Since the `std::vector` allocation and init are outside the benchmark loops (something you didn't show in your SO question!) that has a big effect on the total time. — Peter Cordes, Nov 05 '21 at 08:59
Oh, they're doing different things. The `count_if` version is doing `sum = ...` ; `benchmark::DoNotOptimize(sum);`, but the range-for version is doing `sum += ... ` ; `DoNotOptimize(sum)`. This is forcing a store/reload of it inside the loop for some reason, so that store-forwarding latency is part of the critical path latency bottleneck. The count_if version even has `const int sum`! Changing that to `sum += count_if` makes it slower, but still not as slow as the range-for version: https://quick-bench.com/q/LqIbLoR0ynsGal5dx8XC7ZVVJCk — Peter Cordes, Nov 05 '21 at 09:01
If you edit your actual question to reflect what you're really doing, I could post those comments as an answer. — Peter Cordes, Nov 05 '21 at 09:06
@PeterCordes, but I'm looking for the algo version to be more faster no? — Bol, Nov 05 '21 at 09:09
The `count_if` version *is* still faster, i.e. not as slow as the range-for version, with that combination of compiler, options, and STL header library. Basically by chance, though, I think; normally loops over std::vector can auto-vectorize with SIMD. — Peter Cordes, Nov 05 '21 at 09:15

Why is std::count_if faster or slower than an explicit loop depending on optimization settings?

0 Answers0