Why slower function run faster if surrounded with another functions?

Question

Just a little c++ code, confirmed behavior in java.

This is example code what reproduce this behavior compiled with Visual Studio 2019 Release x64. I got:

611ms for just increment element.

631ms for increment element with cache, so additional 20ms for overhead.

But when i add heavy op for before each increment(i choised random number generation) and got:

2073ms for just increment element.

1432ms for increment element using cache.

I have intel cpu 10700K, and 3200RAM if it matter.

#include <iostream>
#include <random>
#include <chrono>
#include <cstdlib>


#define ARR_SIZE 256 * 256 * 256 
#define ACCESS_SIZE 256 * 256
#define CACHE_SIZE 1024 
#define ITERATIONS 1000

using namespace std;
using chrono::high_resolution_clock;
using chrono::duration_cast;
using chrono::milliseconds;

int* arr;
int* cache;
int counter = 0;

void flushCache() {
    for (int j = 0; j < CACHE_SIZE; ++j)
    {
        ++arr[cache[j]];
    }
    counter = 0;
}

void incWithCache(int i) {
    cache[counter] = i;
    ++counter;
    if (counter == CACHE_SIZE) {
        flushCache();
    }
}

void incWithoutCache(int i) {
    ++arr[i];
}

int heavyOp() {
    return rand() % 107;
}

void main()
{
    arr = new int[ARR_SIZE];
    cache = new int[CACHE_SIZE];
    int* access = new int[ACCESS_SIZE];

    random_device rd;
    mt19937 gen(rd());

    for (int i = 0; i < ACCESS_SIZE; ++i) {
        access[i] = gen() % (ARR_SIZE);
    }
    for (int i = 0; i < ARR_SIZE; ++i) {
        arr[i] = 0;
    }


    auto t1 = high_resolution_clock::now();
    for (int iter = 0; iter < ITERATIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            incWithoutCache(access[i]);
        }
    }
    auto t2 = high_resolution_clock::now();
    auto ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time without cache " << ms_int.count() << "ms\n";

    t1 = high_resolution_clock::now();
    for (int iter = 0; iter < ITERATIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            incWithCache(access[i]);
        }
        flushCache();
    }
    t2 = high_resolution_clock::now();
    ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time with cache " << ms_int.count() << "ms\n";


    t1 = high_resolution_clock::now();
    for (int iter = 0; iter < ITERATIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            heavyOp();
            incWithoutCache(access[i]);
        }
    }
    t2 = high_resolution_clock::now();
    ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time without cache and time between " << ms_int.count() << "ms\n";

    t1 = high_resolution_clock::now();
    for (int iter = 0; iter < ITERATIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            heavyOp();
            incWithCache(access[i]);
        }
        flushCache();
    }
    t2 = high_resolution_clock::now();
    ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time with cache and time between " << ms_int.count() << "ms\n";
}

Those two operations are not very similar? What are they supposed to do and how are they supposed to be related? And whatnisbyour command linebfor compiling? — Yakk - Adam Nevraumont, Mar 18 '21 at 23:46
Without the generated assembly code we can only guess. I'd check where that differs. `heavyOp` obviously somehow has operations that invalidate the CPU cache. — tofro, Mar 18 '21 at 23:50
@Yakk-AdamNevraumont Given big array, and need to increment given elements. One function increment immediately, second store them in cache and increment when cache filled. I compiled it with visual studio, so i dont know command line, just Release and x64. — Grey, Mar 18 '21 at 23:54
@tofro Seems like it connected with cache. I changed elements to be changed from random to 0, 64, 128, 192 ... etc, and get this effect, with smaller increment it doesnt appear. But i still dont know why so big difference. — Grey, Mar 19 '21 at 00:08
BTW, the only thing stopping your code from compiling with `g++` 10.2 on GNU/Linux is that it's an error for `main` to be `void` instead of `int`. :/ After fixing that, I can't repro your result: without batched updates is faster both ways, on i7-6700k (with 8MiB of L3 cache). Like 370ms vs. 457ms for "fast", 688ms vs. 736ms for "slow" (after running a couple times to warm up CPU frequency). So 1. clearly glibc rand is less slow than MSVCs, and 2. the extra cost for batching updates in your "cache" adds nearly linearly. — Peter Cordes, Mar 19 '21 at 15:36
Note that since you don't use the result of "heavyOp", everything optimizes away except the call to `rand()`. Note that `% 107` is slow anyway, just a multiply and shift. — Peter Cordes, Mar 19 '21 at 15:42
Your global pointers are giving the compiler a hard time; it has to assume that any write to `cache[i]` or read of `access[i]` might be the same memory as `counter`, because they're both `int`. https://godbolt.org/z/ss6PK8 runs significantly faster with G++ -O3 - using `int cache[CACHE_SIZE];` and so on so the compiler knows they don't overlap and can keep things in registers, at least when it's not calling functions like rand. That would mix things up for MSVC, too - would be interesting to see if results are similar but faster, or qualitatively different. — Peter Cordes, Mar 19 '21 at 16:56
@PeterCordes correct, the results are reproducible only with VC++ and only in the presence of a call to `rand()`. I made global pointers `const`, which [helped](https://gcc.godbolt.org/z/Wqeczz) with the codegen (edit: it didn't:), but still runs 3, 4 take 3s and 2s respectively, and run 3 is still reported as heavily front-end bound. — rustyx, Mar 19 '21 at 17:01

Dean Johnson · Answer 1 · 2021-03-19T05:32:35.837

I think these kind of questions are extremely hard to answer - optimizing compilers, instruction reordering, and caching all make this hard to analyze but I do have a hypothesis.

First, the difference between incWithoutCache and incWithCache without heavyOp seems reasonable - the second one is simply doing more work.

When you introduce heavyOp is where it gets interesting.

heavyOp + incWithoutCache: incWithoutCache requires a fetch from memory to output to arr. When that memory fetch is complete, it can do the addition. The processor may begin the next heavyOp operation before the increment is complete because of pipelining.

heavyOp + incWithCache: incWithCache does not require a fetch from memory in every iteration as it only has to write out a value. The processor can queue that write to a memory controller and continue. It does do ++counter, but in that case you are always accessing the same value and so I assume this could be cached better then ++arr[i] could from incWithoutCache. When the cache is flushed, the flushing loop can be heavily pipelined - each iteration of the flushing loop is independent and so many iterations will be operating at a time.

So I think the big difference here is that the actual writes to arr cannot be as efficiently pipelined without the cache because heavyOp is trashing your pipeline and potentially your cache. Your heavyOp is taking the same amount of time in either case, but in heavyOp + incWithoutCache the amortized cost of a write to arr is higher because it is not overlapped with other writes to arr such as what can occur with heavyOp + incWithCache.

I think vectorization could theoretically be used for the flushing operation but I did not see that on Compiler Explorer so that may not be a cause of the discrepancy. If vectorization was being used that could explain this speed difference.

I will say I am not an expert in this and could easily be completely wrong about all of this... but it makes sense to me.

*I think vectorization could theoretically be used for the flushing operation* only with AVX-512 gather/scatter, with conflict detection in case the same element needs to be incremented more than once in the same vector of indices. Note that it's batching up a bunch of array *indices* to be incremented, i.e. a histogram, not something SIMD does well. — Peter Cordes, Mar 19 '21 at 15:48
Certainly memory-level parallelism and overlapping work with execution of `rand()` is a factor, but function calls don't stall the pipeline; a memory-destination increment can be in flight while `rand()` is calculating. You'd expect the simple way to be faster because it can overlap the work of doing scattered increments with the work of calling `rand` repeatedly. Unless reloading more global vars around calls to rand leads to more total work for the CPU to chew through. (On Linux with GCC, the no-batching way is faster with and without rand, see my comment on the question). — Peter Cordes, Mar 19 '21 at 16:13
Note that reloading something you just stored doesn't just come from the cache, it can even be forwarded from the [*store buffer*](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram) if the store hasn't committed to cache yet. On Sandybridge-family, this has somewhat variable latency - [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685). Not sure if rand() is fast enough to bottleneck on SF latency; probably not. — Peter Cordes, Mar 19 '21 at 16:16
Hmm, I wonder if storing through the `int *array` might make the compiler think it needs to reload other global `int` vars, because the store might alias something else. MSVC is like `g++ -fno-strict-aliasing` so it might even be doing more reloads. — Peter Cordes, Mar 19 '21 at 16:18
One way to test that would be to use `int array[size]` and so on, instead of pointers. — Peter Cordes, Mar 19 '21 at 16:25

Why slower function run faster if surrounded with another functions?

1 Answers1