3

This is an extended quesion of How can I resolve data dependency in pointer arrays? .

I'll refer the question description first:

If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints.

Here is a new version of example code:

// Make sure it takes at least two cachelines
struct Cacheline {
    int c[128]{};
};
int main() {
    Cacheline d[4];
    vector<int*> f;
    f.resize(100000000);

    // case 1 : counting over the same location
    {
        for (auto i = 0ul; i < f.size(); ++i) {
            f[i] = d[i % 1].c;
        }
        /// this takes 200ms
        for (auto i = 0ul; i < f.size(); ++i) {
            ++*f[i];
        }
    }

    {
        // case 2 : two locations interleaved
        for (auto i = 0ul; i < f.size(); ++i) {
            f[i] = d[i % 2].c;
        }
        /// this takes 100ms
        for (auto i = 0ul; i < f.size(); ++i) {
            ++*f[i];
        }
    }
    ....
    // three locations takes 90ms and four locations takes 85ms
}

I understand that the performance gain of case 2 is because the out-of-order execution mechanism kicks in and hides the latency of data dependency. I'm trying to find a way of optimizing this in general by utilizing OoO execution. The expected method should have negligible pre-processing cost as my use case is against dynamic workloads.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
Amos
  • 3,238
  • 4
  • 19
  • 41
  • 1
    In general how? Generalizing in what way? If you want to be totally general, avoid loop-carried dependency chains when possible, and if you can't then keep them short (e.g. by avoiding store/reload) and/or use multiple accumulators (registers or memory locations) to have multiple dep chains in flight in parallel. – Peter Cordes Mar 04 '20 at 03:33
  • In general as in "no presumptions on either the data distribution or the possibility to avoid loop-carried dependency chains". Could you clarify what you mean by multiple accumulators? I don't see the benifits if all dep chains are still dealing with the same memory localtion. – Amos Mar 04 '20 at 04:17
  • I was talking about reductions like a dot product or sum of an array where multiple SIMD vector accumulators hide FP (or integer) latency. In your case, more like a histogram problem, your accumulators are the counters in memory. The more you distribute your increments over, the better. (Although you plateau at maybe 6 or 8 on current x86 CPUs, and if you had so many that you get cache misses that's of course worse.) – Peter Cordes Mar 04 '20 at 04:30
  • @PeterCordes That's actually a `GroupBy` problem with a `count` aggregator, the key to be grouped by could have patterns like `aaaaaaaaaa` or `abcdefg`. I can see how sub accumulators could help with the histogram problem but for my case, it's not viable because the key cardinality might be huge, or the pattern might be like `aaaaabaaaaacaaaaad...`. – Amos Mar 04 '20 at 04:45
  • @yzt hmm, I don't think it has been seen throughed via compilers https://godbolt.org/z/n-t_eJ – Amos Mar 04 '20 at 04:46
  • @Amos Yeah. I should have godbolted it before stating my suspicions! – yzt Mar 04 '20 at 04:53
  • 1
    For an actual histogram (e.g. counting ASCII character frequency), you'd make e.g. 4 arrays of 256 counters each, and unroll to distribute counts over them. (Then sum at the end.) Works great when there aren't many total bins. But if you have pointers, you'd need to introduce a hash function to use multiple tables, and that would probably be slower than just eating the store-forwarding latency in any realistic case, and maybe even for your "case 1" where it's all one value. Computing a hash function, and checking for hash table hit before incrementing, might cost more. – Peter Cordes Mar 04 '20 at 04:56
  • @yzt: if the compiler had optimized the counter `*f[i]` into a register, we'd see a factor of 6 difference on current Intel CPUs (~6 cycle latency for memory-destination add, including the store/reload). Or a larger factor if it really optimized it away. A factor of 2x is pretty clearly from doing 2 independent dep chains in parallel. (At least to me; I had already answered the linked question :P) – Peter Cordes Mar 04 '20 at 05:32

0 Answers0