Why is the second pass over a vector larger than the cache is faster?

Question

I have been playing around "Memory bandwidth". Let's forget about this term that is not properly defined, and concentrate over the following code:

#include <iostream>
#include <chrono>
#include <vector>

int main() {
  auto size = std::vector<std::size_t>{
      1024 / sizeof(std::size_t),                  // 1kB
      (1024 * 1024) / sizeof(std::size_t),         // 1MB
      (1024 * 1024 * 10) / sizeof(std::size_t),    // 10 MB
      (1024 * 1024 * 100) / sizeof(std::size_t),   // 100 MB
      (1024 * 1024 * 1024) / sizeof(std::size_t),  // 1GB
      (1024 * 1024 * 2047) / sizeof(std::size_t)}; // 2GB
  auto size_name = std::vector<std::string>{
      "1 KB", "1 MB", "10 MB", "100 MB", "1 GB", "2 GB"};
  auto check = std::size_t{0};

  for (auto i = std::size_t{0}; i < size.size(); ++i) {
    auto v = new std::size_t[size[i]];

    auto date_start_cold = std::chrono::high_resolution_clock::now();
    for (auto k = std::size_t{0}; k < size[i]; ++k) {
      check += v[k];
    }
    auto date_end_cold = std::chrono::high_resolution_clock::now();
    auto time_cold = std::chrono::duration_cast<std::chrono::nanoseconds>(
        date_end_cold - date_start_cold).count();

    auto date_start_warm = std::chrono::high_resolution_clock::now();
    for (auto k = std::size_t{0}; k < size[i]; ++k) {
      check += v[k];
    }
    auto date_end_warm = std::chrono::high_resolution_clock::now();
    auto time_warm = std::chrono::duration_cast<std::chrono::nanoseconds>(
        date_end_warm - date_start_warm).count();
    std::cout << "Bandwith for reading n = " << size_name[i] << ": "
        << (size[i] * sizeof(std::size_t)) / static_cast<double>(time_cold)
        << " GB/s (cold), "
        << (size[i] * sizeof(std::size_t)) / static_cast<double>(time_warm)
        << " GB/s (warm)" << std::endl;

    delete[] v;
  }

  std::cout << "Check: " << check << std::endl;
  std::cout << std::endl;

  for (auto i = std::size_t{0}; i < size.size(); ++i) {
    auto v = new std::size_t[size[i]];

    auto date_start_cold = std::chrono::high_resolution_clock::now();
    for (auto k = std::size_t{0}; k < size[i]; ++k) {
      v[k] = k;
    }
    auto date_end_cold = std::chrono::high_resolution_clock::now();
    auto time_cold = std::chrono::duration_cast<std::chrono::nanoseconds>(
        date_end_cold - date_start_cold).count();

    auto date_start_warm = std::chrono::high_resolution_clock::now();
    for (auto k = std::size_t{0}; k < size[i]; ++k) {
      v[k] = k;
    }
    auto date_end_warm = std::chrono::high_resolution_clock::now();
    auto time_warm = std::chrono::duration_cast<std::chrono::nanoseconds>(
        date_end_warm - date_start_warm).count();
    std::cout << "Bandwith for writing n = " << size_name[i] << ": "
        << (size[i] * sizeof(std::size_t)) / static_cast<double>(time_cold)
        << " GB/s (cold), "
        << (size[i] * sizeof(std::size_t)) / static_cast<double>(time_warm)
        << " GB/s (warm)" << std::endl;

    check += v[size[i] - 1];
    delete[] v;
  }

  std::cout << "Check: " << check << std::endl;

  return 0;
}

My L3 cache is only 6 MB. Therefore I did not expect the second pass over a vector whose size is 1 GB to be faster. In the idea I have of a cache, when the last element of a vector of 1 GB is read, the first element of the vector is not in the cache anymore. Therefore, I would expect the cache to be "cold" when reading again the first element. But here are the results I get on my MacBook Air:

Bandwith for reading n = 1 KB: 2.37037 GB/s (cold), 6.2439 GB/s (warm)
Bandwith for reading n = 1 MB: 1.53677 GB/s (cold), 16.1215 GB/s (warm)
Bandwith for reading n = 10 MB: 1.9973 GB/s (cold), 8.19585 GB/s (warm)
Bandwith for reading n = 100 MB: 2.7617 GB/s (cold), 13.675 GB/s (warm)
Bandwith for reading n = 1 GB: 2.86934 GB/s (cold), 16.1397 GB/s (warm)
Bandwith for reading n = 2 GB: 2.79094 GB/s (cold), 14.2438 GB/s (warm)

Bandwith for writing n = 1 KB: 0.392789 GB/s (cold), 9.94175 GB/s (warm)
Bandwith for writing n = 1 MB: 5.74285 GB/s (cold), 16.2936 GB/s (warm)
Bandwith for writing n = 10 MB: 5.94669 GB/s (cold), 9.37043 GB/s (warm)
Bandwith for writing n = 100 MB: 7.02979 GB/s (cold), 9.08043 GB/s (warm)
Bandwith for writing n = 1 GB: 2.5859 GB/s (cold), 8.64279 GB/s (warm)
Bandwith for writing n = 2 GB: 2.34798 GB/s (cold), 8.99719 GB/s (warm)

Does anyone have an explanation for this behaviour?

Have you looked at the disassembly to make sure the compiler hasn't done some clever optimization that means the two loops aren't doing the same thing? — Adam, Aug 08 '15 at 21:23
Are you reading from an uninitialized array? `new std::size_t[size[i]]` only default-initializes the array elements, and I see no initialization prior to the read in `v[k]`. — dyp, Aug 08 '15 at 21:42
dyp: I agree that the array is uninitialised. That's why the cache is "cold" regarding this array. Imagine that the vector has been initialised and then the cache completely trashed by some other data so the array has been evicted from the cache. You could also focus on the writing benchmark. By the way, the new std::size_t[size[i]] does not do any default initialization. It just allocates memory. — InsideLoop, Aug 08 '15 at 21:45
I'm concerned because this is Undefined Behaviour, and if you're compiling with optimizations, the compiler can do all kinds of funny things with UB. Additionally, clang can optimize some accesses to heap memory, so I wouldn't be surprised if it drops some of the loops here. Also note that `new` typically only reserves memory pages, and does not commit them (IIRC); it is possible you're also measuring some memory management overhead. — dyp, Aug 08 '15 at 21:48
(With "default initialization", I was referring to a term defined in the C++ Standard. Default initialization for fundamental types such as `size_t` means *no initialization is performed*. Admittedly, the name might not be very appropriate.) — dyp, Aug 08 '15 at 21:49
dyd: I am interested in your comment about "commiting memory pages". What does it mean? Any reference? — InsideLoop, Aug 08 '15 at 21:51
dyd: For the undefined behaviour, I agree with you that some compilers can do some nasty things. But the write part of my benchmark is properly defined. — InsideLoop, Aug 08 '15 at 21:52
Please use `@username` in a comment to send `username` a notification about the comment. -- Ad "committing memory pages", my terminology is probably confused by the Windows SDK; some info on the topic I've tried to mention can be found in the Q/A [Why isn't malloc filling up memory?](http://stackoverflow.com/q/21577583/). — dyp, Aug 08 '15 at 21:56
When you allocate memory, the OS will not assign physical pages to the virtual memory. This happens the first time you read/write those pages [this applies both for heap allocations and static allocations]. This is because it's common for memory allocations to be largely unused (for example the editor that allows a 4MB file to be loaded, but typically you load a few kilobytes of file), so not assigning physical memory to the allocation is a good saving in this case. the "committing memory pages" refers to the processing in the OS where physical memory pages are assigned to the virtual address. — Mats Petersson, Aug 08 '15 at 22:00
dyd, Mats Petersson: It explains everything. I was not aware of that. It answers my question. — InsideLoop, Aug 08 '15 at 22:05
There's a good article on memory allocation costs at https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/ — Adam, Aug 08 '15 at 22:06
Well, replacing that `auto v = new std::size_t[size[i]];` with `auto volatile* v = new std::size_t[size[i]]();` produces much more reasonable results IMO. This attempts to prevent UB (by using value-initialization instead of default-init) and compiler optimizations (by using `volatile`). — dyp, Aug 08 '15 at 22:06
That entirely depends on microarchitecture. Which CPU are you using? — Leeor, Aug 09 '15 at 16:55
@Leeor: The question has been answered. The difference is due to "hidden" memory allocation that happen when we first touch the array. — InsideLoop, Aug 09 '15 at 17:14

Why is the second pass over a vector larger than the cache is faster?

0 Answers0