The performance of two scan functions

Question

I've made an implementation of two scan functions in C, because I wanted to test windows' QueryPerformance API. According to the benchmark data, lscan finishes in 0.000029 whilst bscan finishes in 0.000016, though I cannot see why is that. I test it with an input source of 60157 characters and no optimization flags.

void bscan (char *source, size_t sourceLen)
{
    int i = 0;
    int b = sourceLen-1;
    int m = (int)((double)sourceLen/2.0);

    while(i < m)
    {
        if(source[i] == '.'); /* Do something */
        if(source[b] == '.'); /* Do something */

        i++;
        b--;
    }

    if(i==b)
    {
        if(source[i] == '.'); /* Do something */
    }
}

void lscan (char *source, size_t sourceLen)
{
    int i = 0;

    while(i < sourceLen)
    {
        if(source[i] == '.'); /* Do something */

        i++;
    }
}

Is bscan an actual thing that one can use to achieve such a performance gain and if so, what makes it that much faster?

Revised code

int bscan (char *source, size_t sourceLen)
{
    int c = 0;
    int i = 0;
    int b = sourceLen-1;
    int m = sourceLen/2.0;

    while(i < m)
    {
        if(source[i] == '.') c++;
        if(source[b] == '.') c++;

        i++;
        b--;
    }

    if(i==b)
    {
        if(source[i] == '.') c++;
    }

    return c;
}

int lscan (char *source, size_t sourceLen)
{
    int i = 0, c = 0;

    while(i < sourceLen)
    {
        if(source[i] == '.') c++;

        i++;
    }

    return c;
}

Remark: The performance difference in the code here is smaller.

Did you really test it as `if(source[b] == '.');`, or did you fill something in the `/* Do something */` comments when doing your benchmark? — Ryan Zhang, Aug 14 '22 at 02:07
Edenia, Aside: Why the `double` cast in `(int)((double)sourceLen/2.0)` versus just `(int)(sourceLen/2)`? — chux - Reinstate Monica, Aug 14 '22 at 02:08
@chux-ReinstateMonica No idea, I was something something else before, so it's probably a leftover from that, haha. — Edenia, Aug 14 '22 at 02:09
@RyanZhang I tested it with the empty ifs. No idea if compiler is supposed to optimize that without flags — Edenia, Aug 14 '22 at 02:12
As it's currently written, it makes no sense to benchmark this code. The compiler is very well allowed to just turn both functions into `void func() { return; }` since you are doing nothing inside them that has an observable effect on the outside. — Marco Bonelli, Aug 14 '22 at 02:13
Sidenote: I might rewrite the body of `lscan` as something like `for (size_t i = 0; i < sourceLen; i++) { if (source[i] == '.') /* something */; }` — Chris, Aug 14 '22 at 02:14
The time difference between the two performances is so small and insignificant, it could just be noise. — Ryan Zhang, Aug 14 '22 at 02:14
Which compiler are you using? We can try to godbolt-it to see what the compiler output is (and what's optimized away and what's not). — Ryan Zhang, Aug 14 '22 at 02:15
@RyanZhang I just added simple counting of the dot character, still `bscan` is faster and is consistently being faster. Lower low-bound and lower average out of 100 runs at least. Also I am using mingw-gcc - gcc (Rev2, Built by MSYS2 project) 10.3.0 — Edenia, Aug 14 '22 at 02:21
I'll take a look at the compiler output. Can you please edit your revised code into the question? — Ryan Zhang, Aug 14 '22 at 02:24
@RyanZhang Done. The timing difference is smaller in the latter case but still consistent. I thought it shouldn't make a difference with the branch prediction in the slightest. — Edenia, Aug 14 '22 at 02:30
I'm looking at the compiler output of this code on `x86 msvc v19`, and there's nothing that should make `bscan` faster. I don't know about mingw-gcc, though. — Ryan Zhang, Aug 14 '22 at 02:34
`if(source[i] == '.'); c++;`. You should delete the `;` before `c++;`. — Ryan Zhang, Aug 14 '22 at 02:36

Peter Cordes · Accepted Answer · 2022-08-14T03:55:21.010

TL:DR: unrolling with two dependency chains (i++ and b--) gets twice as much work done in the same ~6 cycles of bottleneck that creates in a non-optimized build.

Your result is explainable given known compiler and CPU behaviour, but tells you nothing about how it would perform for real, with optimization enabled. (Or with non-empty if() bodies that might branch mispredict if not compiled to branchless increments).

It also has nothing to do with theoretical big-O complexity, purely engineering / micro-optimization effects in the anti-optimized asm. Counting number of operations isn't a useful approach to predict performance except sometimes within an order of magnitude. (Or two orders if a difference in cache misses is expected.)

You compiled without optimization, so it's not a realistic benchmark that tells you anything about what's normally efficient in C. See Idiomatic way of performance evaluation? and C loop optimization help for final assignment (with compiler optimization disabled) re: why this is not a useful way to microbenchmark, and the weird distortions it introduces.

The default (-O0 for GCC/clang, but not ICC) is basically anti-optimization, not keeping variables in registers across C statements, which introduces bottlenecks that wouldn't exist in code built with normal options like -O2 or -O3 -march=native.
Specifically, store-to-load forwarding latency creating long dependency chains through the integer loop-counters. (About 6 cycle latency for load/add/store vs. 1 cycle for increment a register, on recent Intel and AMD. https://uops.info/ and https://agner.org/optimize/. Except for Zen 2 and Ice Lake with their memory renaming that can store-forward with 0 extra latency in cases like this. On those CPUs, I'd expect both loops to have similar performance). You didn't say what CPU you tested on, but I'm guessing some kind of modern x86. Things would be pretty much the same on modern CPUs of other ISAs, at least ones with out-of-order exec so they can overlap independent work across iterations.

Optimizing for this way of compiling often complicates your source for no actual benefit, or at best means you use register int i.

Your bidirectional loop effectively unrolls with two separate dependency chains, splitting the length in two. That 6 cycle per iteration bottleneck is plenty for modern out-of-order exec CPUs to keep up with the throughput of doing all the per-iteration work that depends on reading each i++ (and b--) result. Even in the loop that does both; 6 cycles is a long time for a CPU that can issue 4 instructions per clock.

So I expect cycles per iteration is about the same for both loops, just bottlenecked on store-forwarding latency. But the bidirectional loop only has to run half the number of iterations.

In fact, the loop with more work might actually run faster per iteration, due to some interesting effects on Sandybridge-family CPUs where not trying to reload too soon can actually get store-forwarding to work with less latency. See Adding a redundant assignment speeds up code when compiled without optimization. So it could be more than twice as fast in a debug build.

Of course, if you actually want to count matches, these if() c++ statements hopefully get vectorized at -O3 to SIMD compares. Or do it manually with AVX2 or SSE2 intrinsics: How to count character occurrences using SIMD just bottlenecks on memory bandwidth unless data is already hot in L1d or maybe L2 cache.

In the version of code you measured, the if bodies are empty so there's no actual branch in the asm for the CPU to mispredict even if there is a mix of '.' and other characters. In the version with a counter, if they're all '.' then an unoptimized build of the bidirectional version will slow down to the same speed as other version, bottlenecked on the latency of incrementing c twice per loop iteration.

I'm assuming that branch mispredictions either weren't significant, hurt both equally, or didn't exist because you measured the code you claimed you did, which had empty if() bodies.

GCC10.4 -O0 -mabi=ms on Godbolt should be similar code-gen to what you got from MinGW GCC10.4. Note that there are no asm instructions corresponding to the if(source[i] == '.'); source lines, just the loop-counter increment and compare logic. So fortunately that removes all memory accesses to the char array, removing any possible page-fault overhead in case you didn't read or write to it before the timed region.

With optimization enabled, unrolling with multiple accumulators is usually only important for floating-point stuff like summing an array or a dot product, since FP math instructions typically have higher latency, and integer math is associative so compilers can just unroll loop counters for you. But artificially creating latency bottlenecks for integer loop counters creates the same effects as in questions like Latency bounds and throughput bounds for processors for operations that must occur in sequence.

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
Latency bounds and throughput bounds for processors for operations that must occur in sequence
How many CPU cycles are needed for each assembly instruction? (that's not how performance works; there isn't a single number you can add up across instructions to get total cost. There are 3 dimensions: front-end cost, latency, and back-end port cost. You have to consider each of those 3 as possible bottlenecks.)
https://agner.org/optimize/ - how CPUs work. Also the x86 tag wiki
C loop optimization help for final assignment (with compiler optimization disabled) - why it's silly not to enable optimization, and a big distortion. But if you insist on compiling that way, some things that make a difference only for that.
Idiomatic way of performance evaluation? benchmarking pitfalls including this (unoptimized builds), CPU warmup, and avoiding page-faults.

That's some serious amount of knowledge on the subject. I suspected some of the statements are so, but I am unfortunately lacking knowledge in ASM so I have no idea on the details. I did not know that `-O0` doesn't even cache variables closer to CPU. I am testing on a tiger lake i7-11800h mobile processor — Edenia, Aug 14 '22 at 12:27
@Edenia: `-O0` doesn't disable the *CPU cache*, you're not going to RAM and back. And not even waiting for the store buffer to commit stores back into L1d cache, instead forwarding straight from the store buffer. That's why it's called store-to-load *forwarding*. `-O0` does disable optimizing variables into registers across statements, which is a big deal, but your data does stay inside the CPU pipeline itself. Each CPU core has its own L1d and L2 cache so those are both part of the core, especially L1d, but you could say not part of the pipeline. — Peter Cordes, Aug 14 '22 at 12:30
Yes, also I thought that complexity should be a performance indicator whenever there is no load and because of the unoptimization essentially the code should not be omitted completely, but that was just a guess. — Edenia, Aug 14 '22 at 12:33
Thanks for the details on CPU model number. Maybe Intel only had the 0-latency store-forwarding speedup in Ice Lake, but not Tiger Lake, or disabled it with a microcode update? I'd have expected it to work for this. https://gist.github.com/travisdowns/bc9af3a0aaca5399824cf182f85b9e9c says it is present on IceLake. — Peter Cordes, Aug 14 '22 at 12:33
Later I can test it on an A6 7400k, the result might be interesting on that one. — Edenia, Aug 14 '22 at 12:38
Possibly. Bulldozer-family is not as wide a CPU, but store-forwarding latency should still easily allow enough time for the otherwise-empty loop with no array access to do its 2 counter updates in the shadow of the store-forwarding latency. I wrote this answer assuming a CPU other than Ice Lake or Zen 2, where store-forwarding has about 5 cycle latency. Bulldozer's less sophisticated branch prediction might make it suffer more in the version with `counter++` as the if bodies, if that's not a simple pattern in your string. But sure, play around with it if you're curious. Also try `gcc -O3` — Peter Cordes, Aug 14 '22 at 12:43

John Bollinger · Answer 2 · 2022-08-14T02:42:08.843

1

Your bscan() will perform about half as many index comparisons (i < m and i == b) as your lscan will (i < sourcelen) on the same input. Accordingly, bscan() will also perform fewer branches. It appears that all other operations are the same. Your routines do very little else, so it is entirely plausible that these loop-control operations would be significant enough contributors to the overall execution time to explain the performance difference you see.

That's also consistent with the observation that adding work in the loop body reduces the (relative, I assume) performance difference.

edited Aug 14 '22 at 02:42

answered Aug 14 '22 at 02:36

John Bollinger

160,171
8
81
157

2

Latency of the dependency chains is usually the bottleneck in un-optimized code (GCC defaults to `-O0`, [which won't keep any variables in registers across C statements](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point)). Number of branches is only indirectly a factor: the loop branch should predict near-perfectly for long-running loops. What matters is that you unroll with two dependency chains (`i++` and `b--`), so cycles per iteration is probably about the same, but you get twice as much work done per iteration. – Peter Cordes Aug 14 '22 at 02:58
1

TL:DR: Counting number of operations is not the right approach for modern out-of-order exec CPUs, especially in an un-optimized build that introduces major bottlenecks. [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/q/32000917) / [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) – Peter Cordes Aug 14 '22 at 03:01
Wrote up an answer. We've had a few questions I remember where a canonical about how loop counter dependency chains become significant bottlenecks in un-optimized builds, so hopefully it was worth explaining the details somewhere. – Peter Cordes Aug 14 '22 at 03:48

The performance of two scan functions

2 Answers2

Further reading:

Linked