Why is processing multiple streams of data slower than processing one?

Question

I'm testing how reading multiple streams of data affects a CPUs caching performance. I'm using the following code to benchmark this. The benchmark reads integers stored sequentially in memory and writes partial sums back sequentially. The number of sequential blocks that are read from is varied. Integers from the blocks are read in a round-robin manner.

#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
    int num_values = 100000000;
    // Fix up the number of values. The effect of this should be insignificant.
    num_values -= (num_values % num_arrays);
    int num_values_per_array = num_values / num_arrays;
    // Initialize data to process
    auto results = vector<int>(num_values);
    auto arrays = vector<vector<int>>(num_arrays);
    for (int i = 0; i < num_arrays; ++i) {
        arrays.emplace_back(num_values_per_array);
    }
    for (int i = 0; i < num_values; ++i) {
        arrays[i%num_arrays].emplace_back(i);
        results.emplace_back(0);
    }
    // Try to clear the cache
    const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
    char *c = (char *)malloc(size);
    for (int i = 0; i < 100; i++)
        for (int j = 0; j < size; j++)
            c[j] = i*j;
    free(c);
    auto start = std::chrono::high_resolution_clock::now();
    // Do the processing
    int sum = 0;
    for (int i = 0; i < num_values; ++i) {
        sum += arrays[i%num_arrays][i/num_arrays];
        results[i] = sum;
    }
    std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
    int num_arrays = 1;
    while (true) {
        test_with_split(num_arrays++);
    }
}

Here are the timings for splitting 1-80 ways on an Intel Core 2 Quad CPU Q9550 @ 2.83GHz:

Time taken when splitting to different number of streams

The bump in the speed soon after 8 streams makes sense to me, as the processor has an 8-way associative L1 cache. The 24-way associative L2 cache in turn explains the bump at 24 streams. These especially hold if I'm getting the same effects as in Why is one loop so much slower than two loops?, where multiple big allocations always end up in the same associativity set. To compare I've included at the end timings when the allocation is done in one big block.

However, I don't fully understand the bump from one stream to two streams. My own guess would be that it has something to do with prefetching to L1 cache. Reading the Intel 64 and IA-32 Architectures Optimization Reference Manual it seems that the L2 streaming prefetcher supports tracking up to 32 streams of data, but no such information is given for the L1 streaming prefetcher. Is the L1 prefetcher unable to keep track of multiple streams, or is there something else at play here?

Background

I'm investigating this because I want to understand how organizing entities in a game engine as components in the structure-of-arrays style affects performance. For now it seems that the data required by a transformation being in two components vs. it being in 8-10 components won't matter much with modern CPUs. However, the testing above suggests that sometimes it might make sense to avoid splitting some data into multiple components if that would allow a "bottlenecking" transformation to only use one component, even if this means that some other transformation would have to read data it is not interested in.

Allocating in one block

Here are the timings if instead allocating multiple blocks of data only one is allocated and accessed in a strided manner. This does not change the bump from one stream to two, but I've included it for sake of completeness.

Timings when only one big block is allocated

And here is the modified code for that:

void test_with_split(int num_arrays) {
    int num_values = 100000000;
    num_values -= (num_values % num_arrays);
    int num_values_per_array = num_values / num_arrays;

    // Initialize data to process
    auto results = vector<int>(num_values);
    auto array = vector<int>(num_values);
    for (int i = 0; i < num_values; ++i) {
        array.emplace_back(i);
        results.emplace_back(0);
    }

    // Try to clear the cache
    const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
    char *c = (char *)malloc(size);
    for (int i = 0; i < 100; i++)
        for (int j = 0; j < size; j++)
            c[j] = i*j;
    free(c);

    auto start = std::chrono::high_resolution_clock::now();
    // Do the processing
    int sum = 0;
    for (int i = 0; i < num_values; ++i) {
        sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
        results[i] = sum;
    }
    std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}

Edit 1

I made sure that the 1 vs 2 splits difference was not due to the compiler unrolling the loop and optimizing the first iteration differently. Using the __attribute__ ((noinline)) I made sure the work function is not inlined into the main function. I verified that it did not happen by looking at the generated assembly. The timings after these changed were the same.

If `num_arrays` isn't a compile-time constant, I suggest that you pull the division and modulus out of the inner-most loop. It'll probably kill your performance along with your benchmark. — Mysticial, Jan 23 '15 at 19:44
Agreed. The combination of Modulus and Division on your target architecture (Core2) can take between 30 and 150 cycles, depending on a number of factors. That said, I expect once you resolve that, you'll find the second version will be *significantly* faster compared to the first (more-so than even now). — jstine, Jan 24 '15 at 19:45

score 2 · Accepted Answer · answered Jan 24 '15 at 20:29

To answer the main part of your question: Is the L1 prefetcher able to keep track of multiple streams?

No. This is actually because the L1 cache doesn't have a prefetcher at all. The L1 cache isn't big enough to risk speculatively fetching data that might not be used. It would cause too many evictions and adversely impact any software that isn't reading data in specific patterns suited to that particular L1 cache prediction scheme. Instead L1 caches data that has been explicitly read or written. L1 caches are only helpful when you are writing data and re-reading data that has recently been accessed.

The L1 cache implementation is not the reason for your profile bump from 1X to 2X array depth. On streaming reads like what you've set up, the L1 cache plays little or no factor in performance. Most of your reads are coming directly from the L2 cache. In your first example using nested vectors, some number of reads are probably pulled from L1 (see below).

My guess is your bump from 1X to 2X has a lot to do with the algo and how the compiler is optimizing it. If the compiler knows num_arrays is a constant equal to 1, then it will automatically eliminate a lot of per-iteration overhead for you.

Now for the second part, as to why is the second version faster?:

The reason for the second version being faster is not so much in how the data is arranged in physical memory, but rather what under-the-hood logic change a nested std::vector<std::vector<int>> type implies.

In the nested (first) case, compiled code performs the following steps:

Accesses top-level std::vector class. This class contains a pointer to the start of the data array.
That pointer value must be loaded from memory.
Add current loop offset [i%num_arrays] to that pointer.
Access nested std::vector class data. (Likely L1 cache hit)
Load pointer to the vector's start of data array. (Likely L1 cache hit)
Add loop offset [i/num_arrays]
Read data. Finally!

(note the chances of getting L1 cache hits on steps #4 and #5 decrease drastically after 24 streams or so, due to likeliness of evictions before the next iteration trough the loop)

The second version, by comparison:

Accesses top-level std::vector class.
Load pointer to the vector's start of data array.
Add offset [(i%num_arrays)*num_values_per_array+i/num_arrays]
Read data!

An entire set of under-the-hood steps are removed. The calculation for offset is slightly longer since it needs an extra multiply by num_values_per_array. But the other steps more than make up for it.

From the x86/x64 optimization manual I linked to it seems the L1 data cache does have some kind of a prefetcher. Page 2-23 talks about a prefetcher for the L1 data cache that is also called a "streaming prefetcher". — Olli Saarikivi, Jan 25 '15 at 16:27
I did some additional testing and verified that the compiler optimizing the first loop iteration was not the case. Thanks for your input! I didn't even realize that could happen. — Olli Saarikivi, Jan 25 '15 at 16:30
Yeah I'm familiar with the specific passage you've referenced in the Optimization manual. It's referring to Sandy Bridge & newer architectures. Your Core2 has only the software-based prefetch instruction for hinting stuff into L1. My original draft answer started to include explanations of the newer architectures but... it started to get really long and wouldn't be useful to explaining your specific observations, I felt. — jstine, Jan 26 '15 at 14:50
Intel Sandybridge-family has its main prefetcher in L2 (the "streamer"), but there *is* a prefetcher in L1d, which IIRC is based on execution of load uops and looks for patterns in loops. It's less aggressive than the L2 prefetchers. But yeah, if you're answer for Core2, then that's different, much older and less advanced HW prefetch. — Peter Cordes, Jul 20 '22 at 03:14

Why is processing multiple streams of data slower than processing one?

Background

Allocating in one block

Edit 1

1 Answers1