How does object size impact array perfomance in C++?

Question

#include <iostream>
#include <unistd.h>


using namespace std;


struct object
{
    int i;
    int j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z;
};

object *objectArray;
const int arraySize = 1920 * 1080;

int main()
{
    objectArray = new object[arraySize];

    while (1)
    {
        for (int i = 0; i < arraySize; i++)
        {
            objectArray[i].i = 1234;
        }

        sleep(1);
    }

    return 0;
}

I used the above code to test this. I compiled this program with:

g++ -O3 -std=c++14 src/main.cpp -o bin/main

At the current state it takes about 15 - 20 ms for the for loop to complete.

When struct object looks like this,

struct object
{
    int i;
}

the for loop takes about 0.8 - 1.2 ms to complete.

How and why does the size of struct object impact performance when I'm not even accessing all the members?

Have a look at [Herb Sutter's talk](https://youtu.be/TJHgp1ugKGM?t=23m30s) at the 2014 `//build/` conference. Talks about array performance and caches at 23:30. — Jack Deeth, Dec 10 '16 at 12:16
You test with sleep? how can you see that the loop use 20ms or 1ms? You are superman? Can have an autograph? — Stargateur, Dec 10 '16 at 12:22
I used gettimeofday(), converted this to ms. I know it's not exact, but you can still see a drastic performance difference. — tomatocake, Dec 10 '16 at 12:24
@Kerrek SB : Is it possible that the latter version gets vectorized (because all the int's sit in memory next to each other) while the first version is not ? — Unimportant, Dec 10 '16 at 16:00

score 6 · Accepted Answer · answered Dec 10 '16 at 12:24

Modern desktop CPUs have hierarchical memory: access to the main memory goes through a series of caches (e.g. L2 and L1 caches) which are increasingly smaller and faster. Data that has never been seen before is loaded in to the cache first and then into the CPU registers from there, and the result is stored back into the cache. The cache is only written back to memory later on.

If multiple operations all affect data that's in the cache, then only a single writing back to memory is required at the end of the set of operations, which can be dramatically faster than accessing main memory directly for each single operation.

Moreover, memory is transferred to and from the cache in large blocks, called cache lines. Typical cache line sizes are 64 bytes or 128 bytes.

So when your class is { int i; }, then accessing the first element of the array already brings into the cache a number of subsequent objects, and multiple operations can be performed with just a single fetch from main memory. When the class is large, one cache line only contains the i member of one single array element, and so you need to access main memory for every array element.

Modern processors try to predict which main memory you may need next and start fetching speculatively, but all the same accessing main memory is orders of magnitude slower than accessing the cache, and so the array operation with high stride is significantly more expensive.

It is for this reason that it is important to consider access patterns when optimizing code (and data!) for performance. This is where you would consider "array of structs" vs "struct of arrays". Or, as general wisdom goes, "most of the time, performance problems are the result of a poor choice of data structures".

score 2 · Answer 2 · edited May 23 '17 at 11:53

2

It has to do with how a computer works under the hood. Generally items in memory which are closer to each other are faster to access. This is because a processor accesses memory in aligned chunks and stores it in its cache. If all of the integers are not next to each other they will be spread out and will not fit as well into the cache. Take a look at this question.

Purpose of memory alignment

edited May 23 '17 at 11:53

Community

1
1

answered Dec 10 '16 at 12:07

chasep255

11,745
8
58
115

2

I don't think alignment is as much the issue as *stride* is. – Kerrek SB Dec 10 '16 at 12:08

score 1 · Answer 3 · answered Dec 10 '16 at 12:23

While you are used to working with individual bytes of memory, that's not really how the hardware works; the channel for moving data to and from main memory is optimized for transferring larger chunks of data. For example, it might transmit entire cache lines of 64 consecutive bytes, even if only a single byte was needed.

This matches your performance numbers very closely: one 4-byte int is one sixteenth of a 64 byte cache line, and the slower code takes sixteen times longer than the fast code.

How does object size impact array perfomance in C++?

3 Answers3