Cache-friendly copying of an array with readjustment by known index, gather, scatter

Question

Suppose we have an array of data and another array with indexes.

data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]

We want to create a new array from elements of data at position from index. Result should be

[4, 2, 5, 7, 3, 1]

Naive algorithm works for O(N) but it performs random memory access.

Can you suggest CPU cache friendly algorithm with the same complexity.

PS In my certain case all elements in data array are integers.

PPS Arrays might contain millions of elements.

PPPS I'm ok with SSE/AVX or any other x64 specific optimizations

How did you arrive at your result? Shouldn't it be `[7, 2, 5, 1, 3, 4]`? — Sergey Kalinichenko, Jan 09 '16 at 13:22
@dasblinkenlight: It's the other way round, the index is the index in the resulting array: `result[index[i]] = data[i]` — M Oehm, Jan 09 '16 at 13:24
The naive algorithm is already cache-efficient for reads of both `data[]` and `index[]`, because it reads them sequentially. It's random access for writes, but writes are not cache-friendly no matter how you look at them. — Sergey Kalinichenko, Jan 09 '16 at 13:34
@dasblinkenlight it's not cache friendly for reads. If you index = [0, length -1, etc]. My naive algorithm will read first element and then last element from the data array. So it's a random memory access. If you are talking about another implementation please provide the code. — sh1ng, Jan 09 '16 at 13:48
@FrerichRaabe you read index array sequentially, but get an element from data array randomly. — sh1ng, Jan 09 '16 at 14:03
@sh1ng aren't you doing it like this: `for (int i = 0 ; i != 10000 ; i++) result[index[i]] = data[i];`? — Sergey Kalinichenko, Jan 09 '16 at 14:09
@sh1ng - Are you sure about that? Wouldn't that be the case if result needed to be `[7, 2, 5, 1, 3, 4]`? Would be a difference in algorithm. — Danny_ds, Jan 09 '16 at 14:09
@dasblinkenlight you right my fault. Only write part is an issue. — sh1ng, Jan 09 '16 at 14:18
If writing is really the bottleneck, you would have to use a block based algorithm where you read and write from the same block in the array. — CleoR, Jan 11 '16 at 15:16
Should the result be [4,0,...] or [7,2,...] not [4,2,...]? The example provided confuses not elaborates. — Stephen Quan, Jan 11 '16 at 19:34
@StephenQuan nope, 4 must be on position 0, 2 must be on position 1 etc. `data` array doesn't event contain an element 0. — sh1ng, Jan 11 '16 at 19:37
The question is - do you want to be cache-friendly for bandwidth sensitivity (energy/power considerations), or for performance (latency reduction). In the latter case, it's enough if you're prefetch-friendly, so a solution running ahead with builtin_prefetches dereferencing the indexes at some distance would be perfect. — Leeor, Jan 11 '16 at 21:23
It took a lot of reads to comprehend what is being asked. Perhaps including some pseudo code might have helped, but I finally understand the question now. — Stephen Quan, Jan 12 '16 at 03:02
How is the array of indices populated? How do you know you won't see the same index more than once? How do you know all of the indices are within the bounds of the output array? If you have confidence about these latter, that suggests some control over the source. If you have control over the source, perhaps it can be pre-sorted in a way that would avoid thrashing. — synthetel, Jan 15 '16 at 03:35

Evgeny Kluev · Accepted Answer · 2016-01-15T21:14:42.970

Combine index and data into a single array. Then use some cache-friendly sorting algorithm to sort these pairs (by index). Then get rid of indexes. (You could combine merging/removing indexes with the first/last pass of the sorting algorithm to optimize this a little bit).

For cache-friendly O(N) sorting use radix sort with small enough radix (at most half number of cache lines in CPU cache).

Here is C implementation of radix-sort-like algorithm:

void reorder2(const unsigned size)
{
    const unsigned min_bucket = size / kRadix;
    const unsigned large_buckets = size % kRadix;
    g_counters[0] = 0;

    for (unsigned i = 1; i <= large_buckets; ++i)
        g_counters[i] = g_counters[i - 1] + min_bucket + 1;

    for (unsigned i = large_buckets + 1; i < kRadix; ++i)
        g_counters[i] = g_counters[i - 1] + min_bucket;

    for (unsigned i = 0; i < size; ++i)
    {
        const unsigned dst = g_counters[g_index[i] % kRadix]++;
        g_sort[dst].index = g_index[i] / kRadix;
        g_sort[dst].value = g_input[i];
        __builtin_prefetch(&g_sort[dst + 1].value, 1);
    }

    g_counters[0] = 0;

    for (unsigned i = 1; i < (size + kRadix - 1) / kRadix; ++i)
        g_counters[i] = g_counters[i - 1] + kRadix;

    for (unsigned i = 0; i < size; ++i)
    {
        const unsigned dst = g_counters[g_sort[i].index]++;
        g_output[dst] = g_sort[i].value;
        __builtin_prefetch(&g_output[dst + 1], 1);
    }
}

It differs from radix sort in two aspects: (1) it does not do counting passes because all counters are known in advance; (2) it avoids using power-of-2 values for radix.

This C++ code was used for benchmarking (if you want to run it on 32-bit system, slightly decrease kMaxSize constant).

Here are benchmark results (on Haswell CPU with 6Mb cache):

It is easy to see that small arrays (below ~2 000 000 elements) are cache-friendly even for naive algorithm. Also you may notice that sorting approach starts to be cache-unfriendly at the last point on diagram (with size/radix near 0.75 cache lines in L3 cache). Between these limits sorting approach is more efficient than naive algorithm.

In theory (if we compare only memory bandwidth needed for these algorithms with 64-byte cache lines and 4-byte values) sorting algorithm should be 3 times faster. In practice we have much smaller difference, about 20%. This could be improved if we use smaller 16-bit values for data array (in this case sorting algorithm is about 1.5 times faster).

One more problem with sorting approach is its worst-case behavior when size/radix is close to some power-of-2. This may be either ignored (because there are not so many "bad" sizes) or fixed by making this algorithm slightly more complicated.

If we increase number of passes to 3, all 3 passes use mostly L1 cache, but memory bandwidth is increased by 60%. I used this code to get experimental results: TL; DR. After determining (experimentally) the best radix value, I got somewhat better results for sizes greater than 4 000 000 (where 2-pass algorithm uses L3 cache for one pass) but somewhat worse results for smaller arrays (where 2-pass algorithm uses L2 cache for both passes). As it may be expected, performance is better for 16-bit data.

Conclusion: performance difference is much smaller than difference in complexity of algorithms, so naive approach is almost always better; if performance is very important and only 2 or 4 byte values are used, sorting approach is preferable.

I'm going to try but sorting is a more generic task(with O(NLogN) complaxity). In my case I've already knew the position of each element and no comparison requires. — sh1ng, Jan 09 '16 at 13:43
@EvgenyKluev Doesn't radix sort suffer from almost exactly the same problem as this one? — 2501, Jan 09 '16 at 13:47
@2501: there is no such problem with radix sort if radix is small enough. The only problem here is non-constant stride memory access (which can be easily solved by software prefetch). — Evgeny Kluev, Jan 09 '16 at 13:53
@EvgenyKluev If I remember correctly LSD radix has a cache problem when writing numbers to the new index, which can be random. This is independent of the bucket size. So with a huge array of really random numbers, you will cache miss all the time. — 2501, Jan 09 '16 at 14:00
@2501: a few days ago I implemented counting sort algorithm (which is almost equal to one of 2..3 passes of radix sort) [here](http://stackoverflow.com/a/34598249/1009831). And there were no noticeable problems with cache. — Evgeny Kluev, Jan 09 '16 at 14:05
Counting sort is cache friendly yes. Radix sort is not counting sort. — 2501, Jan 09 '16 at 14:07

Danny_ds · Answer 2 · 2016-01-11T16:13:15.133

data = [1, 2, 3, 4, 5, 7]

index = [5, 1, 4, 0, 2, 3]

We want to create a new array from elements of data at position from index. Result should be

result -> [4, 2, 5, 7, 3, 1]

Single thread, one pass

I think, for a few million elements and on a single thread, the naive approach might be the best here.

Both data and index are accessed (read) sequentially, which is already optimal for the CPU cache. That leaves the random writing, but writing to memory isn't as cache friendly as reading from it anyway.

This would only need one sequential pass through data and index. And chances are some (sometimes many) of the writes will already be cache-friendly too.

Using multiple blocks for `result` - multiple threads

We could allocate or use cache-friendly sized blocks for the result (blocks being regions in the result array), and loop through index and data multiple times (while they stay in the cache).

In each loop we then only write elements to result that fit in the current result-block. This would be 'cache friendly' for the writes too, but needs multiple loops (the number of loops could even get rather high - i.e. size of data / size of result-block).

The above might be an option when using multiple threads: data and index, being read-only, would be shared by all cores at some level in the cache (depending on the cache architecture). The result blocks in each thread would be totally independent (one core never has to wait for the result of another core, or a write in the same region). For example: 10 million elements - each thread could be working on an independent result block of say 500.000 elements (number should be a power of 2).

Combining the values as a pair and sorting them first: this would already take much more time than the naive option (and wouldn't be that cache friendly either).

Also, if there are only a few million of elements (integers), it won't make much of a difference. If we would be talking about billions, or data that doesn't fit in memory, other strategies might be preferable (like for example memory mapping the result set if it doesn't fit in memory).

This answer, for pointing out that all the sort-based solutions are ridiculous. — ams, Jan 11 '16 at 14:43

score 0 · Answer 3 · answered Jan 09 '16 at 14:57

If your problem deals with a lot more data than you show here the fastest way - and probably the most cache friendly - would be to do a large and wide merge sort operation.

So you would divide the input data into reasonable chunks, and have a seperate thread operate on each chunk. The result of this operation would be two arrays much like the input (one data and one destination indexes), however the indexes would be sorted. Then you would have a final thread do a merge operation on the data into the final output array.

As long as the segments are chosen well this should be quite a cache friendly algorithm. By wisely I mean so that the data used by different threads maps onto different cache lines (of your chosen processor) so as to avoid cache thrashing.

CleoR · Answer 4 · 2016-01-11T19:18:23.673

If you have a lot of data and that is indeed the bottle neck you will need to use a block based algorithm where you read and write from the same blocks as much as possible. It will take up to 2 passes over the data to ensure the new array is entirely populated and the block size will need to be set appropriately. The pseudocode is below.

def populate(index,data,newArray,cache)
    blockSize = 1000
    for i = 0; i < size(index); i++
        //We cached this value earlier
        if i in cache
            newArray[i] = cache[i]
            remove(cache,i)
        else
            newIndex = index[i]
            newValue = data[i]
            //Check if this index is in our block
            if i%blockSize != newIndex%blockSize
                //This index is not in our current block, cache it
                cache[newIndex] = newValue
            else
                //This value is in our current block
                newArray[newIndex] = newValue

cache = {}
newArray = []
populate(index,data,newArray,cache)
populate(index,data,newArray,cache)

Analysis

The naive solution accesses the index and data array in order but the new array is accessed in random order. Since the new array is randomly accessed you essentially end up with O(N^2) where N is the number of blocks in the array.

The block based solution does not jump from block to block. It reads the index, data, and new array all in sequence to read and write to the same blocks. If an index will be in another block, it is cached and either retrieved when the block it belongs in comes up or if the block is already passed, it will be retrieved in the second pass. A second pass will not hurt at all. This is O(N).

The only caveat is in dealing with the cache. There are a lot of opportunities to get creative here but in general if a lot of the reads and writes end up being on different blocks, the cache will grow and this is not optimal. It depends on the makeup of your data, how often this occurs and your cache implementation.

Lets imagine that all of the information inside of the cache exists on one block and it fits in memory. And lets say the cache has y elements. The naive approach would have randomly accessed at least y times. The block based approach will get those in the second pass.

A block based algorithm is exactly what I suggested in my second option (with the possibility of using multiple threads). I think using an extra cache will only introduce extra overhead: 1) `cache[newIndex] = newValue`, isn't that te same as writing directly to the whole result[] in the first place? 2) What would the cost be of `remove(cache,i)` assuming cache is an array? 3) Not really multi-thread friendly I think. 4) cache[] only introduces extra memory usage, which would _steal_ away some cache. — Danny_ds, Jan 11 '16 at 17:58
Naive solution is O(N) time, we're talking about block accesses. — CleoR, Jan 11 '16 at 18:02
You need to write to the block you're reading from. This isn't going to be the case if you random access the new array. The algorithm I proposed only writes to the new array if it is in the current block. — CleoR, Jan 11 '16 at 18:07
The cache isn't an array btw, it's a key value store and so doing cache[newIndex] = newValue is not necessarily randomly accessing the data blob. — CleoR, Jan 11 '16 at 19:05

score 0 · Answer 5 · answered Jan 12 '16 at 03:13

I notice your index completely covers the domain but is in random order.

If you were to sort the index but also apply the same operations to the index array to the data array, the data array would become the result you are after.

There are plenty of sort algoritms to select from, all would satisfy your cache friendly criteria. But their complexity varies. I'd consider either quicksort or mergesort.

If you're interested in this answer I can elaborate with pseudo code.

score 0 · Answer 6 · answered Jan 15 '16 at 08:00

I am concerned this may not be a winning pattern.

We had a piece of code which performed well, and we optimized it by removing a copy.

The result was that it performed poorly (due to cache issues). I can't see how you can produce a single pass algorithm which solves the issue. Using OpenMP, may allow the stalls this will cause to be shared amongst multiple threads.

score 0 · Answer 7 · answered Jan 17 '16 at 20:24

I assume that the reordering happens only once in the same way. If it happens multiple times, then creating some better strategy beforehand (by and appropriate sorting algorithm) will improve performance

I wrote the following program to actually test if a simple split of the target in N blocks helps, and my finding were:

a) even for the worst cases it was not possible to the single thread performance (using segmented writes) does not exceed the naive strategy, and is usually worse by at least a factor of 2

b) However, the performance approaches unity for some subdivisions (probably depends on the processor) and array sizes, thus indicating that it actually would improve the multi-core performance

The consequence of this is: Yes, it's more "cache-friendly" than not subdividing, but for a single thread (and only one reordering) this wont help you a bit.

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>


void main(char **ARGS,int ARGC) {
int N=1<<26;

double* source = malloc(N*sizeof(double)); 
double* target = malloc(N*sizeof(double)); 
int* idx = malloc(N*sizeof(double)); 
int i;
for(i=0;i<N;i++) {
source[i]=i;
target[i]=0;
idx[i] = rand() % N ;
};

struct timeval now,then;
gettimeofday(&now,NULL);
for(i=0;i<N;i++) {
target[idx[i]]=source[i];
};
gettimeofday(&then,NULL);
printf("%f\n",(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);


gettimeofday(&now,NULL);
int j;
int targetblocks;
int M = 24;
int targetblocksize = 1<<M;
targetblocks = (N/targetblocksize);
for(i=0;i<N;i++) {
for(j=0;j<targetblocks;j++) {
int k = idx[i];
if ((k>>M) == j) { 
target[k]=source[i];
};
};
};
gettimeofday(&then,NULL);
printf("%d,%f\n",targetblocks,(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);


};

Cache-friendly copying of an array with readjustment by known index, gather, scatter

7 Answers7

Single thread, one pass

Using multiple blocks for `result` - multiple threads

Linked

Cache-friendly copying of an array with readjustment by known index, gather, scatter

7 Answers7

Single thread, one pass

Using multiple blocks for result - multiple threads

Linked

Using multiple blocks for `result` - multiple threads