Cache friendly offline random read

Question

Consider this function in C++:

void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
    while (b1 != b2) {
        // assert(0 <= *b1 && *b1 < a2 - a1)
        *o++ = a1[*b1++];
    }
}

Its purpose should be clear enough. Unfortunately, b1 contains random data and trash the cache, making foo the bottleneck of my program. Is there anyway I can optimize it?

This is an SSCCE that should resemble my actual code:

#include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>

namespace {
    void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
        while (b1 != b2) {
            // assert(0 <= *b1 && *b1 < a2 - a1)
            *o++ = a1[*b1++];
        }
    }

    constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
    uint32_t data[max_n], index[max_q], result[max_q];
}

int main() {
    uint32_t seed = 0;
    auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
    std::generate_n(data, max_n, rng);
    std::generate_n(index, max_q, [rng]() { return rng() % max_n; });

    auto t1 = std::chrono::high_resolution_clock::now();
    foo(data, data + max_n, index, index + max_q, result);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;

    uint32_t hash = 0;
    for (unsigned i = 0; i < max_q; i++)
        hash += result[i] ^ (i << 8) ^ i;
    std::cout << hash << std::endl;
}

This is not Cache-friendly copying of an array with readjustment by known index, gather, scatter, which asks about random writes and assumes b is a permutation.

I assume that making index std::map is not realistic from the point of how index is created in real code? — NoSenseEtAl, Sep 05 '17 at 17:08
In a quick test, I could *slightly* speed it up by applying some prefetching (while the pattern is random, it's easy to look ahead), but only barely enough to be measurable as a legitimate difference. Perhaps someone who knows more about such memory optimizations can get more out of it. — harold, Sep 05 '17 at 17:17
@johnchen902 IDK if it would, but you know it is sorted, so it is less random access when you use it as indexes. — NoSenseEtAl, Sep 05 '17 at 17:27
@harold - yeah prefetching probably doesn't much here because the index for each iteration doesn't depend on the prior read, so even a normal OoO chip can get a close to full MLP here. Prefetching can still help a bit sometimes, specifically triggering page-walks a bit ahead of time. — BeeOnRope, Sep 06 '17 at 01:36
@johnchen902 - do you have a specific hardware platform in mind? Can you use languages other than C/C++? — BeeOnRope, Sep 06 '17 at 22:00
@BeeOnRope (1) Let's say it's an x86_64 with extensions and cache size of your choice. (My laptop is i7-4712MQ. L1d cache 32K, L2 256K, L3 6144K.) (2) Yes — johnchen902, Sep 07 '17 at 04:19

BeeOnRope · Accepted Answer · 2017-09-08T22:17:56.317

First, let's take a look at the actual performance of the code above:

$ sudo perf stat ./offline-read
0.123023
1451229184

 Performance counter stats for './offline-read':

        184.661547      task-clock (msec)         #    0.997 CPUs utilized          
                 3      context-switches          #    0.016 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               717      page-faults               #    0.004 M/sec                  
       623,638,834      cycles                    #    3.377 GHz                    
       419,309,952      instructions              #    0.67  insn per cycle         
        70,803,672      branches                  #  383.424 M/sec                  
            16,895      branch-misses             #    0.02% of all branches        

       0.185129552 seconds time elapsed

We are getting a low IPC of 0.67, probably caused almost entirely by load-misses to DRAM⁵. Let's confirm:

sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184

 Performance counter stats for './offline-read':

       622,661,371      cycles                                                      
        16,114,063      LLC-load-misses                                             
       368,395,404      cycle_activity_stalls_l3_miss                                   

       0.184045411 seconds time elapsed

So ~370k cycles out of 620k are straight-up stalled on outstanding misses. In fact, the portion of cycles stalled this way in foo() is much higher, close to 90% since perf is also measuring the init and accumulate code which takes about a third of the runtime (but doesn't have significant L3 misses).

This is nothing unexpected, since we knew the random-read pattern a1[*b1++] was going to have essentially zero locality. In fact, the number of LLC-load-misses is 16 million¹, corresponding almost exactly to the 16 million random reads of a1.²

If we just assume 100% of foo() is spending waiting on memory access, we can get an idea of the total cost of each miss: 0.123 sec / 16,114,063 misses == 7.63 ns/miss. On my box, the memory latency is around 60 ns in the best case, so less than 8 ns per miss means we are already extracting a lot of memory-level parallelism (MLP): about 8 misses would have to be overlapped and in-flight on average to achieve this (even totally ignoring the additional traffic from the streaming load of b1 and streaming write of o).

So I don't think there are many tweaks you can apply to the simple loop to do much better. Still, two possibilities are:

Non-temporal stores for the writes to o, if your platform supports them. This would cut out the reads implied by RFO for normal stores. It should be a straight win since o is never read again (inside the timed portion!).
Software prefetching. Carefully tuned prefetching of a1 or b1 could potentially help a bit. The impact is going to be fairly limited, however, since we are already approaching the limits of MLP as described above. Also, we expect the linear reads of b1 to be almost perfectly prefetched by the hardware prefetchers. The random reads of a1 seem like they could be amenable to prefetching, but in practice the ILP in the loop leads to enough MLP though out-of-order processing (at least on big OoO processors like recent x86).

In the comments user harold already mentioned that he tried prefetching with only a small effect.

So since the simple tweaks aren't likely to bear much fruit, you are left with transforming the loop. One "obvious" transformation is to sort the indexes b1 (along with the index element's original position) and then do the reads from a1 in sorted order. This transforms the reads of a1 from completely random, to almost³ linear, but now the writes are all random, which is no better.

Sort and then unsort

The key problem is that the reads of a1 under control of b1 are random, and a1 is large you get a miss-to-DRAM for essentially every read. We can fix that by sorting b1, and then reading a1 in order to get a permuted result. Now you need to "un-permute" the result a1 to get the result in the final order, which is simply another sort, this time on the "output index".

Here's a worked example with the given input array a, index array b and output array o, and i which is the (implicit) position of each element:

  i =   0   1   2   3
  a = [00, 10, 20, 30]
  b = [ 3,  1,  0,  1]
  o = [30, 10, 00, 10] (desired result)

First, sort array b, with the original array position i as secondary data (alternately you may see this as sorting tuples (b[0], 0), (b[1], 1), ...), this gives you the sorted b array b' and the sorted index list i' as shown:

  i' = [ 2, 1, 3, 0]
  b' = [ 0, 1, 1, 3]

Now you can read the permuted result array o' from a under the control of b'. This read is strictly increasing in order, and should be able to operate at close to memcpy speeds. In fact you may be able to take advantage of wide contiguous SIMD reads and some shuffles to do several reads and once and move the 4-byte elements into the right place (duplicating some elements and skipping others):

  a  = [00, 10, 20, 30]
  b' = [ 0,  1,  1,  3]
  o' = [00, 10, 10, 30]

Finally, you de-permute o' to get o, conceptually simply by sorting o' on the permuted indexes i':

  i' = [ 2,  1,  3,  0]
  o' = [00, 10, 10, 30]
  i  = [ 0,  1,  2,  3]
  o  = [30, 10, 00, 10]

Finished!

Now this is the simplest idea of the technique and isn't particularly cache-friendly (each pass conceptually iterates over one or more 2^26-byte arrays), but it at least fully uses every cache line it reads (unlike the original loop which only reads a single element from a cache line, which is why you have 16 million misses even though the data only occupies 1 million cache lines!). All of the reads are more or less linear, so hardware prefetching will help a lot.

How much speedup you get probably large depends on how will you implement the sorts: they need to be fast and cache sensitive. Almost certainly some type of cache-aware radix sort will work best.

Here are some notes on ways to improve this further:

Optimize the amount of sorting

You don't actually need to fully sort b. You just want to sort it "enough" such that the subsequent reads of a under the control of b' are more or less linear. For example, 16 elements fit in a cache line, so you don't need to sort based on the last 4 bits at all: the same linear sequence of cache lines will be read anyways. You could also sort on even fewer bits: e.g., if you ignored the 5 least-significant bits, you'd read cache lines in an "almost linear" way, sometimes swapping two cache lines from the perfectly linear pattern like: 0, 1, 3, 2, 5, 4, 6, 7. Here, you'll still get the full benefit of the L1 cache (subsequent reads to a cache line will always hit), and I suspect such a pattern would still be prefetched well and if not you can always help it with software prefetching.

You can test on your system what the optimal number of ignored bits is. Ignoring bits has two benefits:

Less work to do in the radix search, either from fewer passes needed or needing fewer buckets in one or more passes (which helps caching).
Potentially less work to do to "undo" the permutation in the last step: if the undo by examining the original index array b, ignoring bits means that you get the same savings when undoing the search.

Cache block the work

The above description lays out everything in several sequential, disjoint passes that each work on the entire data set. In practice, you'd probably want to interleave them to get better caching behavior. For example, assuming you use an MSD radix-256 sort, you might do the first pass, sorting the data into 256 buckets of approximately 256K elements each.

Then rather than doing the full second pass, you might finish sorting only the first (or first few) buckets, and proceed to do the read of a based on the resulting block of b'. You are guaranteed that this block is contiguous (i.e., a suffix of the final sorted sequence) so you don't give up any locality in the read, and your reads will generally be cached. You may also do the first pass of de-permuting o' since the block of o' is also hot in the cache (and perhaps you can combine the latter two phases into one loop).

Smart De-permutation

One area for optimization is how exactly the de-permutation of o' is implemented. In the description above, we assume some index array i initially with values [0, 1, 2, ..., max_q] which is sorted along with b. That's conceptually how it works, but you may not need to actually materialize i right away and sort it as auxillary data. In the first pass of the radix sort, for example, the value of i is implicitly known (since you are iterating through the data), so it could be calculated for free⁴ and written out during the first pass without every having appeared in sorted order.

There may also be more efficient ways to do the "unsort" operation than maintaining the full index. For example, the original unsorted b array conceptually has all the information needed to do the unsort, but it is clear to me how to use to efficiently unsort.

Is it be faster?

So will this actually be faster than the naive approach? It depends largely on implementation details especially including the efficiency of the implemented sort. On my hardware, the naive approach is processing about ~140 million elements per second. Online descriptions of cache-aware radix sorts seem to vary from perhaps 200 to 600 million elements/s, and since you need two of those, the opportunity for a big speedup would seem limited if you believe those numbers. On other hand, those numbers are from older hardware, and for slightly more general searches (e.g,. for all 32 bits of the key, while we may be able to use as few as 16 bits).

Only a careful implementation will determine if it is feasible, and feasibility also depends on the hardware. For example, on hardware that can't sustain as much MLP, the sorting-unsorting approach becomes relatively more favorable.

The best approach also depends on the relative values of max_n and max_q. For example, if max_n >> max_q, then the reads will be "sparse" even with optimal sorting, so the naive approach would be better. On the other hand if max_n << max_q, then the same index will usually be read many times, so the sorting approach will have good read locality, the sorting steps will themselves have better locality, and further optimizations which handle duplicate reads explicitly may be possible.

Multiple Cores

It isn't clear from the question whether you are interested in parallelizing this. The naive solution for foo() already does admit a "straightforward" parallelization where you simply partition the a and b arrays into equal sized chunks, on for each thread, which would seem to provide a perfect speedup. Unfortunately, you'll probably find that you get much worse than linear scaling, because you'll be running into resource contention in the memory controller and associated uncore/offcore resources which are shared between all cores on a socket. So it isn't clear how much more throughput you'll get for a purely parallel random read load to memory as you add more cores⁶.

For the radix-sort version, most of the bottlenecks (store throughput, total instruction throughput) are in the core, so I expect it to scale reasonably with additional cores. As Peter mentioned in the comment, if you are using hyperthreading, the sort may have the additional benefit of good locality in the core local L1 and L2 caches, effectively letting each sibling thread use the entire cache, rather than cutting the effective capacity in half. Of course, that involves carefully managing your thread affinity so that sibling threads actually use nearby data, and not just letting the scheduler do whatever it does.

¹ You might ask why the LLC-load-misses isn't say 32 or 48 million, given that we also have to read all 16 million elements of b1 and then the accumulate() call reads all of result. The answer is that LLC-load-misses only counts demand misses that actually miss in the L3. The other mentioned read patterns are totally linear, so the prefetchers will always be bringing the line into the L3 before it is needed. These don't count as "LLC misses" by the definition perf uses.

² You might want to know how I know that the load misses all come from the reads of a1 in foo: I simply used perf record and perf mem to confirm that the misses were coming from the expected assembly instruction.

³ Almost linear because b1 is not a permutation of all indexes, so in principle there can be skipped and duplicate indexes. At the cache-line level, however, it is highly likely that every cache line will be read in-order since each element has a ~63% chance of being included, and a cache line has 16 4-byte elements, so there's only about a 1 in 10 million chance that any given cache has zero elements. So prefetching, which works at the cache line level, will work fine.

⁴ Here I mean that the calculation of the value comes for free or nearly so, but of course the write still costs. This is still much better than the "up-front materialization" approach, however, which first creates the i array [0, 1, 2, ...] needing max_q writes and then again needs another max_q writes to sort it in the first radix sort pass. The implicit materialization only incurs the second write.

⁵ In fact, the IPC of the actual timed section foo() is much lower: about 0.15 based on my calculations. The reported IPC of the entire process is an average of the IPC of the timed section and the initialization and accumulation code before and after which has a much higher IPC.

⁶ Notably, this is different from a how a dependent-load latency bound workflow scales: a load that is doing random read but can only have one load in progress because each load depends on the result of last scales very well to multiple cores because the serial nature of the loads doesn't use many downstream resources (but such loads can conceptually also be sped up even on a single core by changing the core loop to handle more than one dependent load stream in parallel).

Sorting for locality also opens up opportunities for useful multi-threading to take advantage of the private L2/L1 caches in multiple cores. (And of course for doing the sort in parallel). Also for multiple streams of sequential reads to use more total memory bandwidth on CPUs where a single core can't max out sequential-read bandwidth. Future readers, see https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy: latency bound platforms. — Peter Cordes, Sep 08 '17 at 20:56
For the specific case the OP mentions, (`hash += result[i] ^ (i << 8) ^ i;`), you can optimize away putting `o[]` into the correct order. `uint32_t` addition is associative, so you can do it on the fly as you get `o[i]` in whatever order is convenient. Specifically, a SIMD loop over `o'` and `i'` gives you the data you need to `VPSLLD` by 8 and `VPXOR`, then `VPADDD` into a vector accumulator, with AVX2 on x86. — Peter Cordes, Sep 08 '17 at 21:07
@peter - huh, the OP originally had a `std::accumulate` call to verify the result but I think it has since been changed. In any case, I think the idea is only to optimize the `foo` function: that's what's being timed and the crux of the problem underlying the SSCCE. AFAIK the hash is there to validate the result (and prevent compiler optimizations), but is not really interesting in and of itself. — BeeOnRope, Sep 08 '17 at 21:14
Yeah, clearly the putting-it-back-in-order part of the answer is useful, but in general looking for ways to *avoid* doing that in any particular use case is a very good idea. (As this use-case demonstrates). Or at least to delay it until you can combine it with a slow computation, since that might be better (or worse if memory latency + ALU latency is more than OOO execution can hide). — Peter Cordes, Sep 08 '17 at 21:16
I assumed the hash wasn't what the OP really wanted, too, but it's good that they updated to something order-dependent. Enough to stop compiler optimization, but not enough to stop smart humans who post SO answers from optimizing somewhat :P — Peter Cordes, Sep 08 '17 at 21:18
@PeterCordes sure, there certainly may be all sorts of variations from the SSCCE in the actual underlying problem that admit various optimizations, but where do you draw the line? I'm quite sure the indexes in the real problem aren't generated via a random number generator, and we probably don't have the real values for `max_p` and `max_q` either. The way the question is posed though, is fairly straightforward: make `foo()` fast, and it's an interesting problem in its own right - and it could very well be real world (e.g,. the result `o` may be directly output as-is or something). — BeeOnRope, Sep 08 '17 at 21:32
Yeah, the specific optimization I mentioned is just a footnote to the main answer. Totally agree that the real answer should be just about `foo()`. — Peter Cordes, Sep 08 '17 at 21:33
@PeterCordes - you make a good point about the amenability of the sorting version of the problem to adding cores: the radix sort generally seems to be core-bound (e.g., bound on stores, or bound on total uop throughput, or perhaps on a mix of that and various memory/cache latencies and TLB behaviors), while the original problem is pretty much totally latency-to-memory bound. So the radix sort problem should scale well to more cores. — BeeOnRope, Sep 08 '17 at 21:35
... that said, the original latency-to-memory-bound version will actually also scale a fair bit to multiple cores because you basically get more LFBs. You don't get more SQ or memory controller buffer entries though, so the scaling will only be partial (actually I'm quite curious how this will pan out and may test it). I guess you will see a bit of scaling with the first extra core or two, because it seems like the SQ entries are a bit oversized for a single core, but then I guess it will plateau at some level. If you have multiple memory controllers things get interesting. — BeeOnRope, Sep 08 '17 at 21:35
Correction: the SQ actually seems to be a core-private resource, so the SQ entry count itself scales with more cores. Whatever the SQ itself hands off too (memory controller buffers?) probably has some limit, but I don't know how many. — BeeOnRope, Sep 08 '17 at 21:51

score 0 · Answer 2 · edited Sep 07 '17 at 05:11

0

You can partition indices into buckets where higher bits of indices are the same. Beware that if indices are not random the buckets will overflow.

#include <iostream>
#include <chrono>
#include <cassert>
#include <algorithm>
#include <numeric>
#include <vector>

namespace {
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;

void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
    while (b1 != b2) {
        // assert(0 <= *b1 && *b1 < a2 - a1)
        *o++ = a1[*b1++];
    }
}

uint32_t* foo_fx(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, const uint32_t b_offset, uint32_t *o) {
    while (b1 != b2) {
        // assert(0 <= *b1 && *b1 < a2 - a1)
        *o++ = a1[b_offset+(*b1++)];
    }
    return o;
}

uint32_t data[max_n], index[max_q], result[max_q];
std::pair<uint32_t, uint32_t[max_q / 8]>index_fx[16];

}

int main() {
    uint32_t seed = 0;
    auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
    std::generate_n(data, max_n, rng);
    //std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
    for (size_t i = 0; i < max_q;++i) {
        const uint32_t idx = rng() % max_n;
        const uint32_t bucket = idx >> 20;
        assert(bucket < 16);
        index_fx[bucket].second[index_fx[bucket].first] = idx % (1 << 20);
        index_fx[bucket].first++;
        assert((1 << 20)*bucket + index_fx[bucket].second[index_fx[bucket].first - 1] == idx);
    }
    auto t1 = std::chrono::high_resolution_clock::now();
    //foo(data, data + max_n, index, index + max_q, result);

    uint32_t* result_begin = result;
    for (int i = 0; i < 16; ++i) {
        result_begin = foo_fx(data, data + max_n, index_fx[i].second, index_fx[i].second + index_fx[i].first, (1<<20)*i, result_begin);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
    std::cout << std::accumulate(result, result + max_q, 0ull) << std::endl;
}

edited Sep 07 '17 at 05:11

johnchen902

9,531
1
27
69

answered Sep 05 '17 at 18:02

NoSenseEtAl

28,205
28
128
277

1

You are doing the partitioning outside the timed region, which isn't really equivalent to the OPs question: you should leave the `std::generate_n(index, ...)` call alone and put any manipulation of `index` into the timed region for it to be equivalent. Otherwise, I could just "sort" the indexes outside the timed region and this approach would show as very fast indeed! – BeeOnRope Sep 06 '17 at 01:42
no, (sort is slow, partitioning also I tried), this is just writing to different locations in memory when creating index. – NoSenseEtAl Sep 06 '17 at 06:55
Well if you tried good implementations of sorting & partitioning, I don't see much hope, because the only solution I am aware of is to partially sort/partition the data so that the writes have locality of reference. Otherwise you will be bound by memory latency and concurrency for the stores. – BeeOnRope Sep 06 '17 at 07:01
Well difference is that sorting and partitioning(you need to do std::partition more than once) does 10 or so writes per index element, my code does 1. Now this does not mean my code is better than original one... why? Because SSCCE is probably not representative of how real index data is being generated so my slower creation of indexes may slow down real code more than the locality of index partitioning helps... tl;dr OP needs to try out my solution in his real code – NoSenseEtAl Sep 06 '17 at 09:16
Sorry, I got the characters mixed up - when I said "Well if you tried good implementations of sorting & partitioning, I don't see much hope" - I thought you were the OP! Your answer _is_ a form of partitioning and I think overall that's the viable approach. Of course there are types of partitioning (yours for example) that do less than 10 writes per element. Even a full radix sort with base 2^12 would only do ~2 writes per element for the sort. – BeeOnRope Sep 06 '17 at 17:56
Revision 2 summary: *This is an idea what you can do* *On my machine it gives OK speedup* That's implied by the existence of this answer. *hard-coding constants is bad* Those who don't already know this shouldn't do optimization anyway. And the actual information is buried in parentheses (*you can get buffer overflow*) and after BTW (what the code does). – johnchen902 Sep 07 '17 at 05:10
Assuming your code is correct, you don't really need the alternative `foo_fx` for this job. `foo(data + (i << 20), data + ((i + 1) << 20), index_fx[i].second, index_fx[i].second + index_fx[i].first, result_begin); result_begin += index_fx[i].first;` will do. (STL doesn't return the end of output unless it cannot be known beforehand.) – johnchen902 Sep 07 '17 at 05:28
1

And unfortunately, the order of output is significant. Your solution failed to respect that. (IMO it's clear that my requirement is to write a function equivalent to `foo`) Maybe I shouldn't use such a simple `std::accumulate` so those **** will find it sooner. – johnchen902 Sep 07 '17 at 05:38
My code should be is correct because checksum is the same. And for the ordering you can store the index of the index together with the index( so you will be writing to o[index_of_index], not to *o++ Obviously this is slower than writing to *o++ so it may slow you down more than you get speedup from reading. – NoSenseEtAl Sep 07 '17 at 07:47
1

*My code should be correct because checksum is the same.* That is the exact point of *Maybe I shouldn't use such a simple `std::accumulate`*. Clearly reordering elements won't change the result of `std::accumulate`. – johnchen902 Sep 07 '17 at 16:21
*IDK if you can afford to have index like this* No *if indices are not random you can get buffer overflow* If the indices are truly random there are non-zero chance that it still overflows. The probability seems to be small but one can't be sure without explicit computation. (See: birthday paradox) – johnchen902 Sep 07 '17 at 16:34
1

Probability is practically 0 if you know indexes are uniformly distributed and there is a lot of them. if you are bored you can print out the max sizes of buckets for 1000 different runs and see. – NoSenseEtAl Sep 07 '17 at 18:08
I tried playing a bit more with this and failed. If you get a better algorithm please paste the solution as an answer I am really curious now if it can be improved. :) Only improvement you can do now is to fuse the generation of indexes and writing to output, so you avoid one write to index array, but IDK if that is possible in real code. – NoSenseEtAl Sep 08 '17 at 11:59
BTW I will delete this answer soon, since it is useless, if you need to c/p code for later or something do it. :) – NoSenseEtAl Sep 08 '17 at 12:05