Modern CPUs are best at sequential access. You get the fastest access times when iterating over arrays in sequence in both directions because hardware prefetchers are designed to recognize such access pattern and prefetch the data into L1 cache hopefully before you start accessing it.
In other words, std::vector<>
and std::array<>
are the fastest in terms of sequential iteration.
See "Intel® 64 and IA-32 Architectures Optimization Reference Manual" 2.2.5.4 Data Prefetching for more details:
... The goal of the prefetchers is to automatically predict which data the program is about to consume. If this data is not close-by to the execution core or inner cache, the prefetchers bring it from the next levels of cache hierarchy
and memory. Prefetching has the following effects:
• Improves performance if data is arranged sequentially in the order used in the program.
• May cause slight performance degradation due to bandwidth issues, if access patterns are sparse instead of local.
• On rare occasions, if the algorithm's working set is tuned to occupy most of the cache and unneeded prefetches evict lines required by the program, hardware prefetcher may cause severe performance degradation due to cache capacity of L1.
Data Prefetch to L1 Data Cache
Data prefetching is triggered by load operations when the following conditions are met:
• Load is from writeback memory type.
• The prefetched data is within the same 4K byte page as the load instruction that triggered it.
• No fence is in progress in the pipeline.
• Not many other load misses are in progress.
• There is not a continuous stream of stores.