Modern desktop CPUs have hierarchical memory: access to the main memory goes through a series of caches (e.g. L2 and L1 caches) which are increasingly smaller and faster. Data that has never been seen before is loaded in to the cache first and then into the CPU registers from there, and the result is stored back into the cache. The cache is only written back to memory later on.
If multiple operations all affect data that's in the cache, then only a single writing back to memory is required at the end of the set of operations, which can be dramatically faster than accessing main memory directly for each single operation.
Moreover, memory is transferred to and from the cache in large blocks, called cache lines. Typical cache line sizes are 64 bytes or 128 bytes.
So when your class is { int i; }
, then accessing the first element of the array already brings into the cache a number of subsequent objects, and multiple operations can be performed with just a single fetch from main memory. When the class is large, one cache line only contains the i
member of one single array element, and so you need to access main memory for every array element.
Modern processors try to predict which main memory you may need next and start fetching speculatively, but all the same accessing main memory is orders of magnitude slower than accessing the cache, and so the array operation with high stride is significantly more expensive.
It is for this reason that it is important to consider access patterns when optimizing code (and data!) for performance. This is where you would consider "array of structs" vs "struct of arrays". Or, as general wisdom goes, "most of the time, performance problems are the result of a poor choice of data structures".