Is there a performance gain on x86 Windows/Linux apps in minimizing page count?

Question

I'm writing and running software on Windows and Linux of the last five years, say, on machines of the last five years.

I believe the running processes will never get close to the amount of memory we have (we're typically using like 3GB out of 32GB) but I'm certain we'll be filling up the cache to overflowing.

I have a data structure modification which makes the data structure run a bit slower (that is, do quite a bit more pointer math) but reduces both the cache line count and page count (working set) of the application.

My understanding is that if processes' memory is being paged out to disk, reducing the page count and thus disk I/O is a huge win in terms of wall clock time and latency (e.g. to respond to network events, user events, timers, etc.) However with memory so cheap nowadays, I don't think processes using my data structure will ever page out of memory. So is there still an advantage in improving this? For example if I'm using 10,000 cache lines, does it matter whether they're spread over 157 4k pages (=643k) or 10,000 4k pages (=40MB)? Subquestion: even if you're nowhere near filling up RAM, are there cases a modern OS will page a running process out anyways? (I think Linux of the 90s might have done so to increase file caching, for instance.)

In contrast, I also can reduce cache lines and I suspect this will probably wall clock time quite a bit. Just to be clear on what I'm counting, I'm counting the number of 64-byte-aligned areas of memory that I'm touching at least one byte in as the total cache lines used by this structure. Example: if hypothetically I have 40,000 16 byte structures, a computer whose processes are constantly referring to more memory that it has L1 cache will probably run everything faster if those structures are packed into 10,000 64 byte cache lines instead of each straddling two cache lines apiece for 80,000 cache lines?

Hadi Brais · Accepted Answer · 2019-06-26T12:56:47.360

The general recommendation is to minimize the number of cache lines and virtual pages that exhibit high temporal locality on the same core during the same time interval of the lifetime of a given application.

When the total number of needed physical pages is about to exceed the capacity of main memory, the OS will attempt to free up some physical memory by paging out some of the resident pages to secondary storage. This situation may result in major page faults, which can significantly impact performance. Even if you know with certainty that the system will not reach that point, there are other performance issues that may occur when unnecessarily using more virtual pages (i.e., there is wasted space in the used pages).

On Intel and AMD processors, page table entries and other paging structures are cached in hardware caches in the memory management unit to efficiently translate virtual addresses to physical addresses. These include the L1 and L2 TLBs. On an L2 TLB miss, a hardware component called the page walker is engaged to peform the required address translation. More pages means more misses. On pre-Broadwell¹ and pre-Zen microarchitectures, there can only be one outstanding page walk at any time. On later microarchitectures, there can only be two. In addition, on Intel Ivy Bridge and later, it may be more difficult for the TLB prefetcher to keep up with the misses.

On Intel and AMD processors, the L1D and L2 caches are designed so that all cache lines within the same 4K page are guaranteed to be mapped to different cache sets. So if all of the cache lines of a page are used, in contest to, for example, spreading out the cache lines in 10 different pages, conflict misses in these cache levels may be reduced. That said, on all AMD processors and on Intel pre-Haswell microarchitectures, bank conflicts are more likely to occur when accesses are more spread across cache sets.

On Intel and AMD processors, hardware data prefetchers don't work across 4K boundaries. An access pattern that could be detected by one or more prefetchers but has the accesses spread out across many pages would benefit less from hardware prefetching.

On Windows/Intel, the accessed bits of the page table entries of all present pages are reset every second by the working set manager. So when accesses are unnecessarily spread out in the virtual address space, the number of page walks that require microcode assists (to set the accessed bit) per memory access may become larger.

The same applies to minor page faults as well (on Intel and AMD). Both Windows and Linux use demand paging, i.e., an allocated virtual page is only mapped to a physical page on demand. When a virtual page that is not yet mapped to a physical page is accessed, a minor page fault occurs. Just like with the accessed bit, the number of minor page faults per access may be larger.

On Intel processors, with more pages accessed, it becomes more likely for nearby accesses on the same logical core to 4K-alias. For more information on 4K aliasing, see: L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes.

There are probably other potential issues.

Subquestion: even if you're nowhere near filling up RAM, are there cases a modern OS will page a running process out anyways? (I think Linux of the 90s might have done so to increase file caching, for instance.)

On both Windows and Linux, there is a maximum working set size limit on each process. On Linux, this is called RLIMIT_RSS and is not enforced. On Windows, this limit can be either soft (the default) or hard. The limit is only enforced if it is hard (which can be specified by calling the SetProcessWorkingSetSizeEx function and passing the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag). When a process reaches its hard working set limit, further memory allocation requests will be satisfied by paging out some of its pages to the page file, even if free physical memory is available.

I don't know about Linux of the 90s.

Footnotes:

(1) The Intel optimization manual mentions in Section 2.2.3 that Skylake can do two page walks in parallel compared to one in Haswell and earlier microarchitectures. To my knowledge, the manual does not clearly mention whether Broadwell can do one or two page walks in parallel. However, according to slide 10 of these Intel slides (entitled "Intel Xeon Processor D: The First Xeon processor optimized for dense solutions"), Xeon Broadwell supports two page walks. I think this also applies to all Broadwell models.

Linux will still page out infrequently-unused pages to make more room for cache. This is good for throughput, bad for worst-case response latency for interactive use. The normal suggestion is to set sysctl `vm.swappiness = 5` instead of the default 60 on desktops/laptops. — Peter Cordes, Jun 17 '19 at 15:58
Hadi, bravo, incredible depth there. I don't know half the buzz-phrases but you've given me the vocabulary I need to good further. Before I didn't even know WHAT to google. — Swiss Frank, Jun 18 '19 at 08:54
I love the answer but can we start it off with a 1-2 line summary? If I understand what you're saying (and I'm not 100% confident), it sounds like you're saying, generally, YES, everything else being equal, on a machine with far more physical memory than processes need, and no file cache paging things out... reducing page count will improve performance even should the total cache lines accessed be the same. My last question is whether you can quantify this even vaguely? If we're talking about say 60MB apps having a modern server to themselves, might reducing the page count show improvement? — Swiss Frank, Jun 18 '19 at 09:00
@SwissFrank Sorry for being lazy and not adding links for more information. I think the first sentence in the answer is a good summary. Your understanding looks good. All of the issues that I have pointed out have performance counters that can be used to detect them. You can find a lot of information here on Stack Overflow on this stuff. — Hadi Brais, Jun 18 '19 at 09:31
no, no, not lazy by any means, probably the best answer I've ever gotten on StackExchange. No links needed since you've supplied so many keywords. My only point is that it might be too detailed for some to understand so starting off with a summary is a good idea. And while you've given one it is also at a very technical and precise level. If I add a sentence or two as an edit please only take it as a credit that your excellent answer should be reaching the widest audience possible. — Swiss Frank, Jun 18 '19 at 13:21
@SwissFrank Sure, you can go ahead and make edits that may improve the readability of the answer. — Hadi Brais, Jun 18 '19 at 14:11

Is there a performance gain on x86 Windows/Linux apps in minimizing page count?

1 Answers1