data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from
index. Result should be
result -> [4, 2, 5, 7, 3, 1]
Single thread, one pass
I think, for a few million elements and on a single thread, the naive approach might be the best here.
Both data
and index
are accessed (read) sequentially, which is already optimal for the CPU cache. That leaves the random writing, but writing to memory isn't as cache friendly as reading from it anyway.
This would only need one sequential pass through data and index. And chances are some (sometimes many) of the writes will already be cache-friendly too.
Using multiple blocks for result
- multiple threads
We could allocate or use cache-friendly sized blocks for the result (blocks being regions in the result array
), and loop through index
and data
multiple times (while they stay in the cache).
In each loop we then only write elements to result
that fit in the current result-block. This would be 'cache friendly' for the writes too, but needs multiple loops (the number of loops could even get rather high - i.e. size of data / size of result-block
).
The above might be an option when using multiple threads: data
and index
, being read-only, would be shared by all cores at some level in the cache (depending on the cache architecture). The result
blocks in each thread would be totally independent (one core never has to wait for the result of another core, or a write in the same region). For example: 10 million elements - each thread could be working on an independent result block of say 500.000 elements (number should be a power of 2).
Combining the values as a pair and sorting them first: this would already take much more time than the naive option (and wouldn't be that cache friendly either).
Also, if there are only a few million of elements (integers), it won't make much of a difference. If we would be talking about billions, or data that doesn't fit in memory, other strategies might be preferable (like for example memory mapping the result set if it doesn't fit in memory).