18

I was reading a blog on 64-bit Firefox edition on hacks.mozilla.org.

The author states:

For asm.js code, the increased address space also lets us use hardware memory protection to safely remove bounds checks from asm.js heap accesses. The gains are pretty dramatic: 8%-17% on the asmjs-apps-*-throughput tests as reported on arewefastyet.com.

I was trying to understand how 64-bit hardware have automatic bounds check (assuming compiler does with hardware support) for C/C++. I could not find any answers in SO. I found one technical paper on this subject, but I am not able to grasp how this is done.

Can someone explain 64-bit hardware aids in bounds check?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Raghupathy
  • 823
  • 1
  • 9
  • 21
  • What in the paper is unclear to you? Do you understand how virtual memory mapping works? – Sneftel Apr 10 '15 at 15:51
  • @Sneftel, I under stood they are using 64-bit's huge virtual page tabe to do this, Will go through the paper again tor wrap my mind around this. – Raghupathy Apr 10 '15 at 16:09

2 Answers2

11

Most modern CPUs implement virtual addressing/virtual memory - when a program references a particular address, that address is virtual; the mapping to a physical page, if any, is implemented by the CPU's MMU (memory management unit). The CPU translates every virtual address to a physical address by looking it up in the page table the OS set up for the current process. These lookups are cached by the TLB, so most of the time there's no extra delay. (In some non-x86 CPU designs, TLB misses are handled in software by the OS.)

So my program accesses address 0x8050, which is in virtual page 8 (assuming the standard 4096 byte (0x1000) page size). The CPU sees that virtual page 8 is mapped to physical page 200, and so performs a read at physical address 200 * 4096 + 0x50 == 0xC8050.

What happens when the CPU does not have a TLB mapping for that virtual address? Such a thing occurs frequently because the TLB is of limited size. The answer is that the CPU generates a page fault, which is handled by the OS.

Several outcomes can occur as a result of a page fault:

  • One, the OS can say "oh, well it just wasn't in the TLB because I couldn't fit it". The OS evicts an entry from the TLB and stuffs in the new entry using the process's page table map, and then lets the process keep running. This happens thousands of times per second on moderately loaded machines. (On CPUs with hardware TLB miss handling, like x86, this case is handled in hardware, and is not even a "minor" page fault.)
  • Two, the OS can say "oh, well that virtual page isn't mapped right now because the physical page it was using was swapped to disk because I ran out of memory". The OS suspends the process, finds some memory to use (perhaps by swapping out some other virtual mapping), queues a disk read for the requested physical memory, and when the disk read completes, resumes the process with the freshly filled page table mapping. (This is a "major" page fault.)
  • Three, the process is trying to access memory for which no mapping exists - it's reading memory it shouldn't be. This is commonly called a segmentation fault.

The relevant case is number 3. When a segfault happens, the default behavior of the operating system is to abort the process and do things like write out a core file. However, a process is allowed to trap its own segfaults and attempt to handle them, perhaps even without stopping. This is where things get interesting.

We can use this to our advantage to perform 'hardware accelerated' index checks, but there are a few more stumbling blocks we hit trying to do so.

First, the general idea: for every array, we put it in its own virtual memory region, with all of the pages that contain the array data being mapped as usual. On either side of the real array data, we create virtual page mappings that are unreadable and unwritable. If you attempt to read outside of the array, you'll generate a page fault. The compiler inserts its own page fault handler when it made the program, and it handles the page fault, turning it into an index-out-of-bounds exception.

Stumbling block number one is that we can only mark whole pages as being readable or not. Array sizes may not be an even multiple of a page size, so we have a problem - we can't put fences exactly before and after the end of the array. The best we can do is leave a small gap either before the beginning of the array or after the end of the array between the array and the nearest 'fence' page.

How do they get around this? Well, in Java's case, it's not easy to compile code that performs negative indexing; and if it does, it doesn't matter anyway because the negative index is treated like it's unsigned, which puts the index far ahead of the beginning of the array, which means that it's very likely to hit unmapped memory and will cause a fault anyway.

So what they do is to align the array so that the end of the array butts up right against the end of a page, like so ('-' means unmapped, '+' means mapped):

-----------++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
|  Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  | ...
                 |----------------array---------------------------|

Now, if the index is past end of the array, it'll hit page 7, which is unmapped, which will cause a page fault, which will turn into an index out of bounds exception. If the index is before the beginning of the array (that is, it's negative), then because it's treated as an unsigned value, it'll become very large and positive, putting us far past page 7 again causing an unmapped memory read, causing a page fault, which will again turn into an index out of bounds exception.

Stumbling block number 2 is that we really should leave a lot of unmapped virtual memory past the end of the array before we map the next object, otherwise, if an index was out of bounds, but far, far, far out of bounds, it might hit a valid page and not cause an index-out-of-bounds exception, and instead would read or write arbitrary memory.

In order to solve this, we just use huge amounts of virtual memory - we put each array into its own 4 GiB region of memory, of which only the first N few pages are actually mapped. We can do this because we're just using address space here, not actual physical memory. A 64 bit process has ~4 billion chunks of 4 GiB regions of memory, so we have plenty of address space to work with before we run out. On a 32-bit CPU or process, we have very little address space to play around with, so this technique isn't very feasible. As it is, many 32-bit programs today are running out of virtual address space just trying to access real memory, nevermind trying to map empty 'fence' pages in that space to try to use as 'hardware accelerated' index range checks.

antiduh
  • 11,853
  • 4
  • 43
  • 66
  • Nice explanation +1 - expect for "Most modern CPUs implement virtual addressing/virtual memory", Most (billions) processors sold in 2014/5 are relatively small simple embedded ones (most are 32 or even 16 bit) and certainly at _least_ 100s of million of those do not use virtual addressing. C is very popular there. But I would agree ""Most 64-bit CPUs implement ..." – chux - Reinstate Monica Apr 10 '15 at 20:24
  • 1
    @Chux, You got me there, but I could take an entire paragraph to try to define the set of processors we're talking about.. "modern cpus that are 32-bit or 64-bit for desktop, laptop, mobile, server platforms". Even then you could poke holes in that language. The point is you've got to take the context of the conversation - CPUs that firefox will run on. – antiduh Apr 10 '15 at 21:49
  • Nice explanation, covered a few non-obvious details like how you align the array within a page. But TLB misses don't run kernel code. The hardware walks the page table to find the entry for that page. The TLB is a cache for the page tables. The OS only has to get involved when the page isn't present in the page table (or present without the needed permission, e.g. write.) – Peter Cordes Jul 14 '15 at 02:52
  • So apparently some CPU architectures (e.g. MIPS) do have software TLB-miss handling, like @antiduh described. I editted the post anyway to be more correct, but I might have made it unnecessarily long or more confusing. I did add some wikipedia links, and corrected the page size in the example to the standard 4kiB. – Peter Cordes Jul 14 '15 at 03:25
  • This sounds like a recipe for TLB thrashing. – TLW Aug 07 '23 at 01:20
  • @TLW - why would it thrash the TLB? The TLB would only map pages that have real content in them, in which case, this scheme upsets the TLB no more than not using this scheme. The array is still linear in memory, so its colocation properties still hold. You don't get TLB thrashing when accessing unmapped memory. So what's there to thrash? The data speaks for itself - this scheme is vastly quicker than inserting bounds checks. – antiduh Aug 07 '23 at 13:23
  • Each array consumes (at least) one TLB entry in this scheme (maybe even 2 depending on prefetching and if said TLB caches invalid entries), whereas "normally" you can have many arrays fitting within a single TLB entry. (This depends on the size of the arrays compared to page size, of course.) – TLW Aug 07 '23 at 19:34
  • It does not surprise me that this scheme performs significantly better on microbenchmarks; I would be concerned about larger applications no longer fitting in the TLB. – TLW Aug 07 '23 at 19:35
  • Is this optimization applied to strings / char arrays? – TLW Aug 07 '23 at 19:38
3

The technique they are using is similar to the Windows pageheap debugging mode, only instead of a heap that sticks each VirtualAlloc() in its own virtual-memory page, this is a system that sticks each array (static or stack based) in its own virtual-memory page (more precisely, it places the allocation at the end of the page, because running off the end of an array is far more common than trying to access before the beginning of it); it then places an inaccessible "guard page" after the allocation's page, or even a sizable quantity of pages in their case.

With that, bounds checks are not an issue, because an out-of-bounds access will trigger an access violation (SIGSEGV) instead of corrupting memory. This wasn't possible on earlier hardware simply because a 32-bit machine only had 1M pages to play with, and that wasn't enough to handle a non-toy application.

LThode
  • 1,843
  • 1
  • 17
  • 28
  • Doesn't that use a lot more memory? Assuming they are using 4K pages, for small arrays that are less than 4K, that will use a lot more memory. If they are using 2M pages or 1G pages, then it's really wasteful. – Mark Lakata Apr 10 '15 at 16:01
  • @MarkLakata In the paper that I had cited, they are tackling this issue of sparse memory usage. – Raghupathy Apr 10 '15 at 16:10
  • @MarkLakata -- it uses a large swathe of *virtual address space* -- ofc, only the physical storage actually necessary to store things is consumed, as guard/trap pages don't need to be backed by anything at all. – LThode Apr 10 '15 at 16:14
  • 1
    But virtual memory/physical memory mapping is done in units of page size (4K by default). You have to map the entire virtual memory page to an entire physical memory page. So that means a small array of length 32 bytes (for example) will now take 4096 bytes. The authors acknowledge that there is a hit also to TLB and cache performance, but I guess this is acceptable in their benchmarks, because all of their arrays are much larger than 4K. – Mark Lakata Apr 10 '15 at 18:04