How can I resolve data dependency in pointer arrays?

Question

If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example

int* data[2];
int a, b;
a = b = 0;
for (auto i = 0ul; i < 2; ++i) {
    // Case 3: 2.5 sec
    data[i] = &a;

    // Case 2: 1.25 sec
    // if (i & 1)
    //     data[i] = &a;
    // else
    //     data[i] = &b;
}

for (auto i = 0ul; i < 1000000000; ++i) {
    // Case 1: 0.5sec
    // asm volatile("" : "+g"(i)); // deoptimize
    // ++*data[0];

    ++*data[i & 1];
}

In summary, the observations are: (described the loop body)

case 1 (fast): ++*pointer[0]

case 2 (medium): ++*pointer[i] with half pointer pointing to one int and other half pointing to another int.

case 3 (slow): ++*pointer[i] with all pointer pointing to the same int

Here are my current thoughts. Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation, while in Case 2 and Case 3, we need to write the result out in each iteration. The reason that Case 3 is slower than Case 2 is because when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.

Do I understand it correctly? Is there any way to make Case 3 faster without changing the pointer array? (perhaps adding some CPU hints?)

The question is extracted from the real problem https://github.com/ClickHouse/ClickHouse/pull/7550

case 1 is fast because the compiler optimized the loop away. case 2 is medium probably because the cpu can run the two chains in parallel effectively halving the iteration count. — Jester, Oct 31 '19 at 17:57
Is this purely theoretical, or is there any real problem that you try to solve? If it's a real problem, then having an array of pointers, that all point to the same value would be the first problem to solve ;) I would also argue, that case 3 is the slowest, because the cpu can run nothing in parallel. Hard to optimize such a case, because each iteration depends on the last one. — Lukas-T, Oct 31 '19 at 20:10
@Scott Huntter It's to give a minimal reproduceable example that explains the three cases. — Amos, Nov 01 '19 at 00:40
@Jester Case 1 has special compiler hint to deoptimize, and yup that's superscalar. — Amos, Nov 01 '19 at 00:41
@churill Here is the real problem, https://github.com/ClickHouse/ClickHouse/pull/7550 , the reason we might have these kind of pointer arrays is because we're trying to use vectorized execution instead of JIT. — Amos, Nov 01 '19 at 00:41
Even with your hint, the loop is indeed not optimized away but the write to memory is - the compiler (at least the one I tried on godbolt) just keeps the value in a register. — Jester, Nov 01 '19 at 00:45
Exactly, that's what "buffering the operation" I mentioned, by using a register to buffer all `++` operations and write to memory once. Hmm, but that seems to with the help of the compiler though. — Amos, Nov 01 '19 at 00:47

Peter Cordes · Accepted Answer · 2019-11-01T01:03:11.373

You've discovered one of the effects that causes bottlenecks in histograms. A workaround for that problem is to keep multiple arrays of counters and rotate through them, so repeated runs of the same index are distributed over 2 or 4 different counters in memory.

(Then loop over the arrays of counters to sum them down into one final set of counts. This part can benefit from SIMD.)

Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation

No, it's not the CPU, it's a compile-time optimization.

++*pointer[0] is fast because the compiler can hoist the store/reload out of the loop and actually just increment a register. (If you don't use the result, it might optimize away even that.)

Assumption of no data-race UB lets the compiler assume that nothing else is modifying pointer[0] so it's definitely the same object being incremented every time. And the as-if rule lets it keep *pointer[0] in a register instead of actually doing a memory-destination increment.

So that means 1 cycle latency for the increment, and of course it can combine multiple increments into one and do *pointer[0] += n if it fully unrolls and optimizes away the loop.

when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.

Yes, the data dependency through that memory location is the problem. Without knowing at compile time that the pointers all point to the same place, the compiler will make asm that does actually increment the pointed-to memory location.

"wait for the write to finish" isn't strictly accurate, though. The CPU has a store buffer to decouple store execution from cache misses, and out-of-order speculative exec from stores actually committing to L1d and being visible to other cores. A reload of recently-stored data doesn't have to wait for it to commit to cache; store forwarding from the store-buffer to a reload is a thing once the CPU detects it.

On modern Intel CPUs, store-forwarding latency is about 5 cycles, so a memory-destination add has 6-cycle latency. (1 for the add, 5 for the store/reload if it's on the critical path.)

And yes, out-of-order execution lets two of these 6-cycle-latency dependency chains run in parallel. And the loop overhead is hidden under that latency, again by OoO exec.

Store-to-Load Forwarding and Memory Disambiguation in x86 Processors on stuffedcow.net
Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
How does store to load forwarding happens in case of unaligned memory access?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
Why is execution time of a process shorter when another process shares the same HT core (On Sandybridge-family, store-forwarding latency can be reduced if you don't try to reload right away.)

Is there any way to make Case 3 faster without changing the pointer array?

Yes, if that case is expected, maybe branch on it:

    int *current_pointer = pointer[0];
    int repeats = 1;
    ...

    loop {
        if (pointer[i] == current_pointer) {
            repeats++;
        } else {
            *current_pointer += repeats;
            current_pointer = pointer[i];
            repeats = 1;
        }
    }

We optimize by counting a run-length of repeating the same pointer.

This is totally defeated by Case 2 and will perform poorly if long runs are not common.

Short runs can be hidden by out-of-order exec; only when the dep chain becomes long enough to fill the ROB (reorder buffer) do we actually stall.

Branching on it is the only attempt I tried and does work locally, however fails generally. See the pull-request link on my question comments for the real problem explanation. I'll attach it here too https://github.com/ClickHouse/ClickHouse/pull/7550/files — Amos, Nov 01 '19 at 00:58
@Amos: Like I said in my answer, it only works for the very special case of Case 3. I only tacked it on at the end because it answers the specific question asked, not the general case. Like I said at the top of the answer, histograms can replicate the objects to be incremented (an array of counts) and unroll their loop. Other than that, I'm not aware of any good solution for this type of problem unless it's worth maybe (partially) sorting the input array. You're at the mercy of out-of-order execution to overlap latencies for different objects. — Peter Cordes, Nov 01 '19 at 01:05
"only when the dep chain becomes long enough to fill the ROB (reorder buffer) do we actually stall. " How long would that be? Maybe we can have an estimator on the run length and encode it in a way that only long runs are treated differently? — Amos, Nov 01 '19 at 02:11
@Amos: Yup, that's correct. Modern x86 CPUs all have large out-of-order execution resources to find instruction-level parallelism across short blocks of serial dependencies. See [Are loads and stores the only instructions that gets reordered?](//stackoverflow.com/a/50496379), and [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](//stackoverflow.com/q/51986046), and a [more basic intro answer](https://softwareengineering.stackexchange.com/questions/349972/how-does-a-single-thread-run-on-multiple-cores/350024#350024) — Peter Cordes, Nov 01 '19 at 02:14
@Amos: yes, if you can flag long runs specially, short OoO exec can eat the short runs. Of course the non-special cases will never be as fast as Case 1 when each increment actually has to update *some* memory location. You could construct a separate test case that distributes pointers over 32 different objects or something to see the best case for memory increments. — Peter Cordes, Nov 01 '19 at 02:16
"the best case for memory increments" What do you mean by the best case? I'm not sure if distributing pointers could help determine the maxium run length. — Amos, Nov 01 '19 at 02:23
@Amos: I mean the best non-special case where you just `++*pointer[i]` run normally. Perfect handling of long runs (once you find them) can be as fast as Case 1, but other cases are non-special. That will do well if pointers don't repeat frequently so OoO exec sees lots of separate dependency chains, and memory disambiguation can predict that the load part of `add dword [rdi], 1` is independent of previous stores. Or if's fine if it forwards from a store that hasn't committed yet, as long as there was enough work in between that it didn't have to stall. — Peter Cordes, Nov 01 '19 at 02:29
@Amos: Of course, "long runs" are a problem even if they're interleaved with a few other things, like Case 2, or like `a a a b a a a c a a a d a a a e ...`. That's nearly as bad as Case 3 (bottlenecked on incrementing `a`) but doesn't have any long runs of *just* `a`. — Peter Cordes, Nov 01 '19 at 02:31
Yup. It seems very tricky to build a reliable estimator on that. — Amos, Nov 01 '19 at 02:37
@Amos: Yes, this is a hard problem. I suggested the possibility of a single-pass partial sort or something, that might help. Or of course a smart-enough JIT could do the combining work once and make it fast for subsequent executions. If it's literally just independent increments, maybe you can histogram once and then apply that repeatedly, if you're reusing the same sequence multiple times. In general avoid creating this problem in the first place, if you can. — Peter Cordes, Nov 01 '19 at 02:41

How can I resolve data dependency in pointer arrays?

1 Answers1

Linked