Why is summing over members of this struct of arrays much faster than summing over an array of structs?

Question

I have been using https://github.com/google/benchmark and g++ 9.4.0 to check the performance of data access in different scenarios (compilation with "-O3"). The result has been surprising to me.

My baseline is accessing longs in an std::array ("reduced data"). I want to add an additional byte datum. One time I create an additional container ("split data") and one time I store a struct in the arrays ("combined data").

This is the code:

#include <benchmark/benchmark.h>

#include <array>
#include <random>

constexpr int width  = 640;
constexpr int height = 480;

std::array<std::uint64_t, width * height> containerWithReducedData;

std::array<std::uint64_t, width * height> container1WithSplitData;
std::array<std::uint8_t, width * height>  container2WithSplitData;

struct CombinedData
{
    std::uint64_t first;
    std::uint8_t  second;
};

std::array<CombinedData, width * height> containerWithCombinedData;

void fillReducedData(const benchmark::State& state)
{
    // Variable is intentionally unused
    static_cast<void>(state);

    // Generate pseudo-random numbers (no seed, therefore always the same numbers)
    // NOLINTNEXTLINE
    auto engine = std::mt19937{};

    auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};

    for (int row = 0; row < height; ++row)
    {
        for (int column = 0; column < width; ++column)
        {
            const std::uint64_t number = longsDistribution(engine);

            containerWithReducedData.at(static_cast<unsigned int>(row * width + column)) = number;
        }
    }
}

std::uint64_t accessReducedData()
{
    std::uint64_t value = 0;

    for (int row = 0; row < height; ++row)
    {
        for (int column = 0; column < width; ++column)
        {
            value += containerWithReducedData.at(static_cast<unsigned int>(row * width + column));
        }
    }

    return value;
}

static void BM_AccessReducedData(benchmark::State& state)
{
    // Perform setup here
    for (auto _ : state)
    {
        // Variable is intentionally unused
        static_cast<void>(_);

        // This code gets timed
        benchmark::DoNotOptimize(accessReducedData());
    }
}

BENCHMARK(BM_AccessReducedData)->Setup(fillReducedData);

void fillSplitData(const benchmark::State& state)
{
    // Variable is intentionally unused
    static_cast<void>(state);

    // Generate pseudo-random numbers (no seed, therefore always the same numbers)
    // NOLINTNEXTLINE
    auto engine = std::mt19937{};

    auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
    auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};

    for (int row = 0; row < height; ++row)
    {
        for (int column = 0; column < width; ++column)
        {
            const std::uint64_t number                                                  = longsDistribution(engine);
            container1WithSplitData.at(static_cast<unsigned int>(row * width + column)) = number;

            const std::uint8_t additionalNumber                                         = bytesDistribution(engine);
            container2WithSplitData.at(static_cast<unsigned int>(row * width + column)) = additionalNumber;
        }
    }
}

std::uint64_t accessSplitData()
{
    std::uint64_t value = 0;

    for (int row = 0; row < height; ++row)
    {
        for (int column = 0; column < width; ++column)
        {
            value += container1WithSplitData.at(static_cast<unsigned int>(row * width + column));
            value += container2WithSplitData.at(static_cast<unsigned int>(row * width + column));
        }
    }

    return value;
}

static void BM_AccessSplitData(benchmark::State& state)
{
    // Perform setup here
    for (auto _ : state)
    {
        // Variable is intentionally unused
        static_cast<void>(_);

        // This code gets timed
        benchmark::DoNotOptimize(accessSplitData());
    }
}

BENCHMARK(BM_AccessSplitData)->Setup(fillSplitData);

void fillCombinedData(const benchmark::State& state)
{
    // Variable is intentionally unused
    static_cast<void>(state);

    // Generate pseudo-random numbers (no seed, therefore always the same numbers)
    // NOLINTNEXTLINE
    auto engine = std::mt19937{};

    auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
    auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};

    for (int row = 0; row < height; ++row)
    {
        for (int column = 0; column < width; ++column)
        {
            const std::uint64_t number = longsDistribution(engine);
            containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first = number;

            const std::uint8_t additionalNumber = bytesDistribution(engine);
            containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second = additionalNumber;
        }
    }
}

std::uint64_t accessCombinedData()
{
    std::uint64_t value = 0;

    for (int row = 0; row < height; ++row)
    {
        for (int column = 0; column < width; ++column)
        {
            value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first;
            value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second;
        }
    }

    return value;
}

static void BM_AccessCombinedData(benchmark::State& state)
{
    // Perform setup here
    for (auto _ : state)
    {
        // Variable is intentionally unused
        static_cast<void>(_);

        // This code gets timed
        benchmark::DoNotOptimize(accessCombinedData());
    }
}

BENCHMARK(BM_AccessCombinedData)->Setup(fillCombinedData);

Live demo

And this is the result:

Run on (12 X 4104.01 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.33, 1.82, 1.06
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_AccessReducedData       55133 ns        55133 ns        12309
BM_AccessSplitData         64089 ns        64089 ns        10439
BM_AccessCombinedData     170470 ns       170470 ns         3827

I am not surprised by the long running times of BM_AccessCombinedData. There is additional effort (compared to "reduced data") to add the bytes. My interpretation is that the added byte does not fit into the cache line anymore, which makes the access much more expensive. (Might there be even another effect?)

But why is it so fast to access different containers ("split data")? There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower? But it is almost three times faster than the access of the combined data! Isn't this surprising?

First of all, any kind of benchmarking should be done on optimized code, don't attempt to disable optimizations. Secondly, even for the combined data you access the array twice, when you could copy the structure (or possibly use pointers/references to the structure). Try copying the structure once, and using a reference to the structure. And thirdly, if you know you will not go out of bounds, use `operator[]` to access elements rather than `at`, because `at` will have bounds-checking which adds overhead. — Some programmer dude, Jul 12 '22 at 09:14
Note that with [clang and on quick-bench](https://quick-bench.com/q/TVrDlEqWAvDdNlDOrJAjk4HIR_8) results are different. — Marek R, Jul 12 '22 at 09:16
@Someprogrammerdude Introducing `const CombinedData& data = containerWithCombinedData.at(static_cast(row * width + column));` and using that for access doesn't change anything. I guess that the compiler does something like that automatically. — Benjamin Bihler, Jul 12 '22 at 09:18
[gcc 11.2](https://quick-bench.com/q/jvsHky9J5t7PpyHby-foIepvmcY) [gcc 9.4](https://quick-bench.com/q/GD55sfO2Qa2T6LU-fA4B8qSJ9Os). — Marek R, Jul 12 '22 at 09:19
@MarekR Interesting. Unfortunately with clang 13 the "split data access" and "combined data access" seem to be similar, because now both are slow. :-( — Benjamin Bihler, Jul 12 '22 at 09:23
Looks like compiler still outsmart you: https://godbolt.org/z/W65fMEWY3 (note lines 284-286 in assembly). Writing correct performance tests is hard when compiler is able optimize away lots of stuff (everything in one source/library). Global state is main problem here. — Marek R, Jul 12 '22 at 09:26
@Someprogrammerdude I have checked it now, the assembly code is exactly the same. — Benjamin Bihler, Jul 12 '22 at 09:32
@MarekR If I add `benchmark::DoNotOptimize(...)` around the access of the split containers, the benchmark takes nearly as long as the access of the combined container. I guess that this is what you have meant. But I don't understand it. What has the compiler optimized here? — Benjamin Bihler, Jul 12 '22 at 09:48
None of the functions have any side-effects, they are, in essence, pure. So if the values they return aren't used, there's no need to actually call them and the compiler could just skip the call. I'm not exactly sure what `DoNotOptimize` is doing, but hopefully it will make sure that the calls happens even if the returned value isn't used. — Some programmer dude, Jul 12 '22 at 11:35
@Someprogrammerdude This is the purpose of `DoNotOptimize`, yes. — Benjamin Bihler, Jul 12 '22 at 11:37
The split version has about half the memory bandwidth of combined. (Note that `sizeof(CombinedData) == 16`, not `9`, because `alignof(uint64_t) == 8`). And combined might be defeating auto-vectorization; have to check the asm. The same function gets called on the same data repeatedly, only forcing the result to be generated, so it's also possible that compilers are inlining and hoisting some of the work. — Peter Cordes, Jul 14 '22 at 07:11
(`DoNotOptimize(x)` is like `volatile int dummy = x; x = dummy`, but using inline asm to avoid actually having to store/reload to forget the value, just forcing the compiler to have the result in a register. But it does also include `"memory"` clobber, so that makes the compiler assume that your array might have been modified, so it can't reuse the same computation. See [Google Benchmark Frameworks DoNotOptimize](https://stackoverflow.com/q/66795357) for how it works: it doesn't mean "don't optimize the function call", it's still standard C++ so it can only work on the value you use it on.) — Peter Cordes, Jul 14 '22 at 07:15
I looked at the asm on Quick-bench; it does auto-vectorize, but with a pretty dumb strategy for `Combined` that involves packing and masking, and unpacking again. Not sure how much overhead that's adding per element, or if it's all just memory bandwidth. It seems to be using 2 different pointers inside the combined loop (RAX and RDX), starting from 2 different absolute addresses. Ah, 8 bytes apart, so one is a pointer to the byte member). The strategy it uses for `Split` is not great, either, failing to use `psadbw` / `paddq` to accumulate the sum of 16 bytes. (Split loops might do better.) — Peter Cordes, Jul 15 '22 at 17:17
Yeah (about Peter's comment). You can help the compiler help you by having a separate loop to add for each of the split data containers. This specific task of summing _separate_ containers of different sized integers is a great scenario for SIMD instructions. I would use `std::reduce(std::execution::unseq, x1.cbegin(), x1.cend(), std::uint64_t{0}, std::plus{}) + std::reduce(...)`. — starball, Jul 18 '22 at 05:35

starball · Accepted Answer · 2023-03-05T00:41:32.907

Preface: This answer was written only for the example/scenario you provided in your benchmark link: a summing reduction over interleaved vs non-interleaved collections of differently sized integers. Summing is an unsequenced operation. You can visit elements of the collections and add them to the accumulating result in any order. And whether you "combine" (via struct) or "split" (via separate arrays), the order of accumulation doesn't matter.

Note: It would help if you provided some information about what you already know about optimization techniques and what processors/memory are usually capable of. Your comments show you know about caching, but I have no idea what else you know, or what exactly you know about caching.

Terminology

This choice of "combined" vs "split" has other well known names:

This question fits into the area of concerns people have when talking about data-oriented design.

For the rest of this answer, I will stay consistent with your terminology.

Alignment, Padding, and Structs

quoting from CppReference,

The C++ language has this requirement:

Every complete object type has a property called alignment requirement, which is an integer value of type size_t representing the number of bytes between successive addresses at which objects of this type can be allocated. The valid alignment values are non-negative integral powers of two.

"Every complete object" includes instances of structs in memory. Reading on...

In order to satisfy alignment requirements of all members of a struct, padding may be inserted after some of its members.

One of its examples demonstrate:

// objects of struct X must be allocated at 4-byte boundaries
// because X.n must be allocated at 4-byte boundaries
// because int's alignment requirement is (usually) 4
struct X {
    int n;  // size: 4, alignment: 4
    char c; // size: 1, alignment: 1
    // three bytes padding
}; // size: 8, alignment: 4

This is what Peter Cordes mentioned in the comments. Because of this requirement/property/feature of the C++ language, there is inserted padding for your "combined" collection.

I am not sure if there is a significant detriment to cache performance resulting from the padding here, since the sum only visits each element of the arrays once. In a scenario where elements are frequently revisited, this is more likely to matter: the padding of the combined representation results in "wasted" bytes of the cache when compared with the split representation, and that wastage is more likely to have a significant impact on cache performance. But the degree to which this matters depends on the patterns of revisiting the data.

SIMD

wikipedia article

SIMD instructions are specialized CPU machine instructions for performing an operation on multiple pieces of data in memory, such as summing a group of same-sized integers which are laid out next to each other in memory (which is exactly what can be done in the "split"-representation version of your scenario).

Compared to machine code that doesn't use SIMD, SIMD usage can provide a constant-factor improvement (the value of the constant factor is based on the SIMD instruction). Ex. a SIMD instruction that adds 8 bytes together should be 8 times faster than a loop which does the same thing, or an unrolled loop which does the same thing.

Other keywords: vectorization, parallelized code.

Peter Cordes mentioned relevant examples (psadbw, paddq). Here's a list of intel SSE instructions for arithmetic.

As Peter mentioned, a degree of SIMD usage is still possible in the "combined" representation, but not as much as is possible with the "split" representation. It comes down to what the target machine architecture's instruction set provides. I don't think there's a dedicated SIMD instruction for your example's "combined" representation.

The Code

For the "split" representation, I would do something like:

// ...
#include <numeric>    // for `std::reduce`
#include <execution>  // for `std::execution`
#include <functional> // for `std::plus`

std::uint64_t accessSplitData()
{
    return std::reduce(std::execution::unseq, container1WithSplitData.cbegin(), container1WithSplitData.cend(), std::uint64_t{0}, std::plus{});
         + std::reduce(std::execution::unseq, container2WithSplitData.cbegin(), container2WithSplitData.cend(), std::uint64_t{0}, std::plus{});
}

// ...

It's a much more direct way to communicate (to readers of the code and to a compiler) an unsequenced sum of collections of integers.

CppReference for std::reduce
CppReference for std::execution::<...>
- Execution policies allow you to convey how an algorithm can and is desired to be performed (whether it is safe/still-correct and desirable to use SIMD or multiple threads). Many of the algorithms in the C++ standard library have a similar overload to accept an execution policy argument.
CppReference for std::plus

But What about the Different Positions?

There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower?

As I showed in the code above, for your specific scenario, there doesn't need to be alternating access. But if the specific scenario is changed to require alternating access, on average, usually I don't think there would be much cache impact.

There is the possible problem of conflict misses if the corresponding entries of the split arrays map to the same cache sets. I don't know how likely this is to be encountered, or if there are techniques in C++ to prevent that. If anyone knows, please edit this answer. If a cache has N-way set associativity, and the access pattern to the "split" representation data only accesses N or fewer arrays in the hot loop (ie. doesn't access any other memory), I believe it should be impossible to run into this.

Misc Notes

I would recommend that you keep your benchmark link in your question unchanged, and if you want to update it, add a new link, so people viewing the discussion will be able to see older versions being referred to.

Out of curiosity, is there a reason why you're not using newer compiler versions for the benchmark like gcc 11?

I highly recommend the usage I showed of std::reduce. It's a widely recommended practice to use a dedicated C++ standard algorithm instead of a raw loop where the algorithm is suited for the task. See the reasons cited in the CppCoreGuidlines link. The code may be long (and in that sense, ugly), but it clearly conveys the intent to perform a sum where the reduction operator (plus) is unsequenced.

Your question is specifically about speed, but it's noteworthy that in C++, the choice of struct-of-array vs array-of-struct can be important where space costs matter, precisely because of alignment and padding.

There are more considerations in choosing struct-of-array vs array-of-struct that I haven't listed: memory-access-patterns are the main consideration for performance. readability and simplicity are also important considerations; you can alleviate problems by building good abstractions, but there's still a limit to that, and maintenance, readability, and simplicity costs of building the abstraction itself.

You may also find this video by Matt Godbolt on "Memory and Caches" useful.

The reason for using gcc 11 is that we have a given project compiler version. Your answer helps me a lot. My main lack of knowledge has probably been regarding set-associative cache. Thank you. — Benjamin Bihler, Jul 19 '22 at 09:29
An efficient SIMD sum is *possible* for the interleaved version, compilers just fail to do that. e.g. load a 16-byte struct into a 16-byte vector register, `pand` (to mask the padding to 0) / `paddq` (to accumulate the 64-bit member and the 8-extended-to-64 member into a vector of 2x uint64_t accumulators). Then horizontal sum (reduce) to scalar at the end. Compilers do much worse, perhaps because they don't consider dealing with vectors of non-uniform element types. IIRC, gcc and clang were both pretty messy, with tons of shuffling; IDK if they'd be any better than scalar. — Peter Cordes, Jul 20 '22 at 02:59
Of course, that efficient-as-possible SIMD sum still wastes 7/16ths of your memory/cache bandwidth, unavoidably because of the data layout. Even with split arrays, compilers were still bad at summing bytes, *not* using `psadbw`, instead unpacking `uint8_t` to `uint64_t` elements for `paddq`. Not doing any reducing with narrower types that still couldn't overflow. If you're bottlenecked on memory bandwidth, that's fine, but if data is hot in L1d cache, that's costing a lot of instructions from missing that `psadbw` peephole and failing to even use `paddw` and so on to combine as it widens. — Peter Cordes, Jul 20 '22 at 03:00
Re: conflict misses: yeah, basically all modern CPUs (and *all* modern x86) have set-associative caches. Typically 8-way associative L1d, although Skylake L2 is only 4-way. (But it's larger so aliasing the same set is rarer, and the element size is different so they won't alias for many iterations). It's generally fine to have at least 4 streams of memory access (read or write) in a loop, more in more recent CPUs, when they're not aliasing in cache. (HW prefetchers can track multiple streams, like 1 per page in L2). [For-loop efficiency: merging loops](https://stackoverflow.com/q/51021262) — Peter Cordes, Jul 20 '22 at 03:27
Sure, let's clean comments. It's worth at least summarizing the point that GCC and clang do a poor job for summing an array of `uint8_t` into a `uint64_t sum`. They should be using `psadbw` (against 0) / `paddq` inside the loop to horizontally sum chunks of 8 bytes into a `uint64_t` : 2 instructions per SIMD vector of 8-bit data, not unpacking each byte to 64-bit on its own https://godbolt.org/z/K6eabvbzP, costing more instructions per 1/8th of a SIMD vector of 8-bit data. This is a big missed optimization (todo: report it: https://github.com/llvm/llvm-project/issues / https://gcc.gnu.org/) — Peter Cordes, Sep 28 '22 at 06:03
(`std::execution::unseq` in your https://godbolt.org/z/1qa5z1vWz example doesn't help; GCC already knows unsigned integers are commutative, unlike float or double, and sometimes GCC's missed optimizations for signed integers. GCC's SSE2 vectorization strategy looks insane, involving some spill/reload because it uses too many registers.) — Peter Cordes, Sep 28 '22 at 06:08