C++: Accessing C-array much faster when accessing elements sequentially

Question

I want to store a 3d-volume in memory. I use a linear array for this purpose and then calculate the 1d-index from the 3d-index. It is wrapped in a class called Volume that provides functions for accessing the data elements of the array. Here is the function for accessing one data element of the volume:

template<typename T>
inline T& Volume<T>::at(size_t x, size_t y, size_t z) {
    if (x >= this->xMax || y >= this->yMax || z >= this->zMax) throw std::out_of_range("Volume index out of bounds");
    return this->volume[x * this->yMax*this->zMax + y*this->zMax + z]
}

Now, this linearises the 3d volume with a Z-fastest index order. If the volume is accessed in a loop like this, it is iterated sequentially over the volume elements as they lie in memory:

Volume<float> volume(10, 20, 30); //parameters define size
for(int x = 0; x < volume.xSize(); ++x) {
    for(int y = 0; y < volume.ySize(); ++y) {
        for int z = 0; z < volume.zSize(); ++z) {
            volume.at(x, y, z);  //do sth with this voxel
        }
    }
}

However, if I wrote the loops like this, they would not be accessed sequentially but in a more "random" order:

Volume<float> volume(10, 20, 30); //parameters define size
for(int z = 0; z < volume.zSize(); ++z) {
    for(int y = 0; y < volume.ySize(); ++y) {
        (for int x = 0; x < volume.zSize(); ++x) {
            volume.at(x, y, z);  //do sth with this voxel
        }
    }
}

Now, the first case runs fast, the second case slow. My first question is: why? I guess it has got something to do with caching, but I'm not sure.

Now, I could rewrite the access function for the volume elements like this:

template<typename T>
inline T& Volume<T>::at(size_t x, size_t y, size_t z) {
    if (x >= this->xMax || y >= this->yMax || z >= this->zMax) throw std::out_of_range("Volume index out of bounds");
    return this->volume[x * this->yMax*this->zMax + y*this->zMax + z]
}

Then loop order #2 would be fast (because access happens sequentially), but loop order #1 slow.

Now, for some reason I need both index orders in my program. And both should be fast. The idea is that it shall be possible to define the index ordering when the volume is created and this index ordering will then be used. First I tried a simple if-else statement in the at function. However, that did not seem to do the trick.

So I tried something like this when the ordering mode is set:

template<typename T>
void Volume<T>::setMemoryLayout(IndexOrder indexOrder) {
    this->mode = indexOrder;
    if (indexOrder == IndexOrder::X_FASTEST) {
        this->accessVoxel = [this](size_t x, size_t y, size_t z)->T& {
            return this->volume[z * this->yMax*this->xMax + y*this->xMax + x];
        };
    } else {
        this->accessVoxel = [this](size_t x, size_t y, size_t z)->T& {
            return this->volume[x * this->yMax* this->zMax + y*this->zMax + z];
        };
    }
}

And then when a voxel is actually accessed:

template<typename T>
inline T& Volume<T>::at(size_t x, size_t y, size_t z) {
    if (x >= this->xMax || y >= this->yMax || z >= this->zMax) throw std::out_of_range("Volume index out of bounds");
    return this->accessVoxel(x, y, z);
}

So my idea was to reduce that overhead from the if-statement that would be necessary inside the at function by dynamically defining a lambda function once when the current mode is changed. It then only has to be called when at is called. However, this did not achieve what I desired.

My question is why my attempts didn't work, and if there is a way I can actually do what I want: a volume that supports X-fastest as well as Y-fastest index ordering and is offering the corresponding performance gain when looped over accordingly.

NOTE: My goal is not to be able to switch between the two modes while there is data assigned to the volume with the data still being read correctly.

You called it on caching. As soon as you start hopping around you don't get nice long runs of data that can be read in batches. The CPU is going to read not only the memory you want, but some number of it's neighbours off of the assumption that if you want N, then pretty quick you will want N+1. Might as well load N+1 now and save time later. If you want N and then N+100, you're going to read the memory around N and then the memory around N+100 and that gets painful. — user4581301, Mar 17 '16 at 23:06
Is your `volume` 3D vector really that small, or is it larger? — Dialecticus, Mar 17 '16 at 23:32
"this did not achieve what I desired" is not a question. You are talking about fast, but what is fast enough for you? And there are some bugs in your code I guess, e.g. `for int x = 0; x < volume.zSize(); ++x`. And yes you can specify storage mode like row_major or column_major at compile time e.g. through specialization. — knivil, Mar 17 '16 at 23:34
An `if` that tests a compile-time constant will optimize away to literally nothing, like it was never there. Evaluating a lamba-expression might not be optimized away when the compiler can't prove what `accessVoxel` is set to. Like @knivil said, making row major vs column major storage order a template parameter should do the trick nicely. You always want the functions of this class to inline, so there's no bloat from needing to generate more stand-alone versions of functions. — Peter Cordes, Mar 18 '16 at 01:58
http://stackoverflow.com/questions/12264970/why-is-my-program-slow-when-looping-over-exactly-8192-elements has an answer with some links to other questions, so if you want to start wading through lots of text about sequential vs. strided access, and potential gotchas when the stride is a multiple of 2048B (cache aliasing) / 4096B (false dependency between accesses on Intel), then start looking there. — Peter Cordes, Mar 18 '16 at 02:05
@knivil: Fast enough is when it is as fast as if one of the two possibilities was hardcoded as in the first code snippet. The `zSize()` is just a typo, actually it is `xSize()`. @Peter Cordes: Thanks. I draw the conclusion that the reason why this performance loss happens because it's unknown at compile time which index ordering will be used, thus the compiler can not optimise as good. @knivil: Sometimes it's 35Mb, sometimes 24Gb in memory. Depends of the size of input data. — bweber, Mar 18 '16 at 19:24
If the code works with dozens of gigabytes then it is not the problem of CPU cache, but of virtual memory page swapping. Look at your hard disk during the execution. If it works like hell then it's VM. That means that my answer is actually the one you should accept as the correct one. — Dialecticus, Mar 20 '16 at 10:16

score 3 · Accepted Answer · answered Mar 17 '16 at 23:30

3

On my cpu (and probably yours) I have 64 byte cache lines. Each cache line holds 16 4 byte floats. When a cache line is fetched for the first float, you don't need repeat that work for the following 15 when accessing sequentially.

Note that it takes about 240 cycles to fetch a cache line from main memory. Fetching from L1 cache is something like 12 cycles, this is a big difference if you can hit L1 repeatedly. (L2 costs about 40 cycles, L3 150 cycles)

The second caching win from sequential access is that the CPU will prefetch data into cache for you when reading sequentially. So if you start at the beginning of an array and move through it sequentially you can even avoid the penalty for reading the cache line in.

L1 is usually 32k of data (and 32k of instruction cache), for me on this machine L2 is 256K, and L3 is megabytes. So the smaller you can keep your memory working set, the more of it you can fit in a given cache. Fitting it all in L1 is optimal.

The third reason sequential access is optimal is that it gives your compiler the opportunity to vectorise the instructions. ie use either SSE or AVX instructions. AVX registers are 32 bytes so can hold 8 floats. Potentially you can operate on 8 consecutive items in the array at once speeding things up by 8 times.

answered Mar 17 '16 at 23:30

Hal

1,061
7
20

1

L1 load-use latency in typical x86 CPUs is 4 or 5 cycles. Intel CPUs with per-core L2 caches have an L2 latency of something like 12c. See http://agner.org/optimize/. The large shared L3 is much slower, esp. on a big Xeon where it can be farther away around the ring bus, or if other cores are keeping the L3 busy as well. Still upvoted for mostly-correct ideas; the specific numbers aren't that important. Do note that modern hardware prefetchers can recognize strided access patterns, so you don't have to pay the full latency penalty for every access. Many can be in flight. – Peter Cordes Mar 18 '16 at 01:39
The point about SIMD only working when the inner loop is accessing contiguous memory is an important one, though, beside making full use of each cache-line fetch. – Peter Cordes Mar 18 '16 at 01:42
@PeterCordes Last cpu I measured went 12,40,150,240 in cycle count from L1 through to main memory (some core i9 xeon). Ulrich Drepper measured a Pentium M and got 3,14,240 (No L3?) Pointing to the top level of Agner's site didn't help me find his numbers, I wonder if you can link them better than that? I don't know which cpu has the numbers you're quoting. So yes, absolutely, the numbers are CPU dependent. Drepper in case someone reading this hasn't seen it and cares - it's seminal: https://www.akkadia.org/drepper/cpumemory.pdf btw "mostly-correct ideas[sic]" will be perceived as insulting. – Hal Mar 18 '16 at 04:59
Agner's microarch pdf has the cache latency numbers. I tend to just link the top level because I can type that url accurately, and I don't want people to miss the other stuff if I just link one. Sorry for imprecise linking. Table 10.11 on pg143 for Haswell. His measurements are: L1: 4c / L2: 12c / L3: 34c. That's on a desktop, not a Xeon, and is probably a best-case figure. L1/L2 will be the same for Xeon vs. laptop because they're per-core. re: P-M: Intel didn't switch to the big shared-L3 until Nehalem, so yes, Pentium M only has L1 / L2. – Peter Cordes Mar 18 '16 at 05:12
Good call on linking Drepper's paper, it's great (but the software-prefetch recommendations are getting a bit out of date, because they were written for P4. Later CPUs have smarter HW prefetchers, so as I understand it, software prefetch is mostly useful for less-predictable patterns. (e.g. a binary-search can prefetch both possibilities for the iteration after this (+1/4 or +3/4)). Anyway, I added it to the [x86 tag wiki](http://stackoverflow.com/tags/x86/info) a couple month ago. – Peter Cordes Mar 18 '16 at 05:15
I said "mostly correct" because the addresses of the loads aren't dependent on the result of the preceding load, so multiple outstanding cache misses will be in flight at once. This means they don't have to pay the full load-use latency of main memory. Also, modern hardware prefetchers supposedly recognize even strided access patterns up to a limited stride size, so HW prefetch would be attempting to fetch future memory if the fill buffers weren't already full from actual reads. I haven't done a lot of memory access tuning, mostly uop counting in CPU-bound stuff. – Peter Cordes Mar 18 '16 at 05:24
And I was trying to keep it short insted of writing my own answer, but thanks to your prodding I seem to have expanded on yours with about an answers worth of typing. :P BTW, which CPU did you measure that has 12c L1 cache? Were you just measuring pointer-chasing with a pointer that points to itself, or in a tiny cycle? Because that's what I did to measure for myself on my SnB CPU. (Using perf counters to count cycles, because `rdtsc` counts reference clocks, not the actual current clock.) – Peter Cordes Mar 18 '16 at 05:29
@PeterCordes You said "mostly-correct [sic]" because you were being condescending and trying to big note your expertise. Sequential access will get pre-fetched whether other memory access patterns can be pre-fetched as well doesn't change that fact. Just cool it a bit, huh? I'm sure you know things and you're insanely brilliant etc etc. You don't actually have to impersonate someone being rude to establish this. People will draw a conclusion you may not care for otherwise. – Hal Mar 18 '16 at 07:31
Honestly, I typed "mostly-correct" because I was in a bit of a hurry and didn't want to take the time to explain everything. Making myself sound like a badass was not the main point, but I admit I did realize it would come across that way. When I re-read it now, it's more condescending than I intended. Sorry. Re: prefetch: The way I read your your 3rd paragraph, it implies that prefetching won't do this for non-sequential accesses. (Otherwise it wouldn't be another separate benefit to sequential). It does often make it possible for the prefetcher to keep up, where it couldn't otherwise. – Peter Cordes Mar 18 '16 at 07:41
Thank you. Although I don't know much about the hardware details, this is kind of what I suspected. I think this is the answer to why it is slower if the array is not looped over sequentially. – bweber Mar 18 '16 at 19:30
It doesn't give me an answer to the second part of my question though, but I think there is no simple one. As mentioned above, the reason for the problem probably is, that the index order is not known at compile time, and thus the compiler can't optimise as well. I circumvented this drawback now by offering a function that returns a pointer to the actual array (or a slice or a row of it), so one can iterate over the actual C-array manually in the way one wants, similar to how the OpenCV people do it with their `cv::Mat` and `cv::Mat::ptr<>`. – bweber Mar 18 '16 at 19:33

score 0 · Answer 2 · answered Mar 17 '16 at 23:28

Physical memory of your computer is not large enough to contain the whole array. One solution to your problem would be to add more memory.

Your OS works with virtual memory. Pages of virtual memory will be moved to disk whenever some more memory is needed. Accessing a disk is time consuming, and that's what kills the performance. In your worse case scenario OS keeps writing and reading (or just reading) the pages all the time. So, another solution would be to reorganize your data in such a way that disk access is about the same regardless of the orientation the pixels are scanned. I propose to have 3D areas the size of a page (usually 4KB, so a cube with the size of 16 pixels). So, when you scan in one direction you would touch only a few of these pages, and when you scan in another direction you would touch the same amount of different set of pages. With a little luck (depends of available physical memory) no pages will needlessly move in and out of the swap file.

The best and simplest solution is to scan the pixels in only one direction. Maybe you don't really have to have the ability to scan the pixels sideways.

The OP gave examples with tiny sizes. This is a caching issue, not a VM paging issue. The problem is essentially the same, though: only making use of a small amount of the data in each cache-line or page as it's brought in from main mem or disk. — Peter Cordes, Mar 18 '16 at 02:07

C++: Accessing C-array much faster when accessing elements sequentially

2 Answers2