Can I prevent the gcc optimizer from delaying memory allocation?

Question

I have a program compiled with gcc 11.2, which allocates some RAM memory first (8 GB) on heap (using new), and later fills it with data read out in real-time from an oscilloscope.

uint32_t* buffer = new uint32_t[0x80000000];
for(uint64_t i = 0; i < 0x80000000; ++i) buffer[i] = GetValueFromOscilloscope();

The problem I am facing is that the optimizer skips the allocation on the first line, and dose it on the fly as am I traversing the loop. This slows down the time spent on each iteration of the loop. Because it is important to be as efficient as possible during the loop, I have found a way to force the compiler to allocate the memory before entering the for loop, namely to set all the reserved values to zero:

uint32_t* buffer = new uint32_t[0x80000000]();

My question is: ¿is there a less intrusive way of achieving the same effect without forcing the data to be zero on the first place (apart from switching off the optimization flags)? I just want to force the compiler to reserve the memory at moment of declaration, but I do not care if the reserved values are zero or not.

Thanks in advance!

EDIT1: The evidence I see for knowing that the optimizer delaying the allocation is that the 'gnome-system-monitor' shows a slowly growing RAM memory as I traverse the loop, and only after I finish the loop, it reaches 8 GiB. Whereas if I initialize all the values to zero, it the gnome-system-monitor shows a quick growth up to 8 GiB, and then it starts the loop.

EDIT2: I am using Ubuntu 22.04.1 LTS

what evidence to you have for "the optimizer skips the allocation on the first line, and dose it on the fly as am I traversing the loop". Please include it in the question — 463035818_is_not_an_ai, Aug 31 '22 at 11:47
You are observing on-demand paging from your operating system. This has nothing to do, whatsoever, with C++. — Sam Varshavchik, Aug 31 '22 at 11:47
Out of curiosity: What happens if you try and initialize element `0x80000000 - 1` before the loop? — paolo, Aug 31 '22 at 11:47
You probably don't need to zero all the memory. Touching one byte per page (probably 4kb) will likely be enough. But you probably need to be concerned with swapping and such as well if the timing is important. You might want to use some API of your operating system which may allow to specify options that reduce the cost due to page faults, e.g. `mmap`/`mlock` on Linux. If you don't need it to be portable, you should add your operating system to the question so an answer specific to that can be given. — user17732522, Aug 31 '22 at 11:54
And more generally you might need a real-time operating system if there are small time windows that you need to guarantee are kept. Operating systems such as Windows or Linux do not make any guarantees in what time window your code will be executed. The operating system's scheduler might suspend your loop to execute some other program/thread at any time, depending on the system's load. — user17732522, Aug 31 '22 at 12:07
I am using Ubuntu 22 and I use taskset --cpu-list 0-3 to ensure that the time window stays constant enough. Right now, it reads data from the oscilloscope in chunks of 512 MB, and it always takes 100ms +/- 5ms (sometimes more, sometimes less, but close), which is quite good for me, so I am not planning to go to a real-time OS. The only problem is that setting all RAM bytes to zero in the beginning takes 1 minute or so, which is slightly annoying. And if I do not set all bytes to zero in the beginning, then the RAM is stored on-the-fly, and the readout takes 125ms instead of 100ms, which is bad — ferdymercury, Aug 31 '22 at 12:48
paolo, thanks for the suggestion. I tried initializing only buffer[0x80000000 - 1], but still, the memory is only allocated during the loop. — ferdymercury, Aug 31 '22 at 12:54
Thanks for the suggestion @user17732522. Indeed, using a for loop to initialize only 1 byte every 4kB works as well as setting all elements to zero in the constructor. In both cases, the memory is initialized before the for loop. It is weird however that I do not see much time difference between all set to zero, and only one byte every 4kB. For 64GB, I see the allocation time to last 12.1 seconds when all are set to zero, vs 11.8 s when set only one every 4kB. Maybe there is a smarter alternative than just using a for loop over the pages. I'll look also into hugepage as suggested by Guillaume. — ferdymercury, Aug 31 '22 at 13:18
@ferdymercury When your process is assigned a new page the kernel will zero the whole page anyway for security purposes. The second zeroing loop you are executing then probably has significant cache benefits. There must also be some overhead from the kernel's page allocation, but if I remember correctly that was significantly less than the cost of zeroing. If you compiled your kernel with debug symbols enabled, you can see that in output of e.g. `perf`. This zeroing can not usually be disabled because without it acquiring a new page would leak memory of previous processes. — user17732522, Aug 31 '22 at 15:13
@ferdymercury As mentioned in an answer, you can also try to map 1GB huge pages with `mmap`, assuming you have enough memory available to allocate them (this may need boot time configuration). This would reduce the page allocation overhead (but not the zeroing overhead). Of course you can also try to do the allocation as early as possible in your program and then just reuse it later to avoid all of this except at program startup. — user17732522, Aug 31 '22 at 15:22
@user17732522 Thanks. I will look into 1GB hugepages, but yes, the problem will probably the zeroing overhead. Sorry about the ambiguous 1 minute comment. Sometimes, for some measurements, I need to allocate 245 GiB of RAM. The zeroing overhead speed makes me lose close to one minute at program startup, as the DDR4-3200 RAM memory is zeroed at 5.5 GiB/s. (I guess only a kernel option disabling the zeroing would improve that value significantly). — ferdymercury, Aug 31 '22 at 15:29
@ferdymercury That is still significantly below what that memory should be able to do, so the page fault overhead itself might still be significant. — user17732522, Aug 31 '22 at 15:31
@ferdymercury By default the kernel build system will not allow you to disable the zeroing either. It requires the MMU support to be disabled, see https://github.com/torvalds/linux/blob/master/mm/Kconfig#L337, which only really makes sense on embedded. So it is probably not straight-forward. — user17732522, Aug 31 '22 at 15:39

score 16 · Answer 1 · edited Sep 01 '22 at 03:22

16

It has very little to do with the optimizer. Nothing spectacular happens here. Your program doesn't skip any lines, and it does exactly what you ask it to do.

The problem is that, when you're allocating memory, you're interfacing with both the allocator and the operating system's paging system. Most likely, your operating system did not make all of those pages resident in memory, but instead made some pages marked as allocated by your program, and will only make this memory actually existing when you actually use it. This is how most operating systems work.

To fix the problem, you will need to interface with the virtual memory allocator of your system to make pages resident. On Linux, there is also the hugepage that may help you. On Windows, there's the VirtualAlloc api, but I haven't dug deep in that platform.

edited Sep 01 '22 at 03:22

Cody Gray - on strike

239,200
50
490
574

answered Aug 31 '22 at 11:51

Guillaume Racicot

39,621
9
77
141

Thanks. I also found the following one that has examples using C++: https://rigtorp.se/hugepages/ – ferdymercury Aug 31 '22 at 16:52
At least for Windows, you'll probably want to follow the `VirtualAlloc` with a [`VirtualLock`](https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtuallock) to ensure the pages stay resident. – Bob Sep 01 '22 at 01:40

Andrej Podzimek · Accepted Answer · 2022-08-31T13:27:05.323

You seem to be misinterpreting the situation. Virtual memory within a user-space process (heap space in this case) does get allocated “immediately” (possibly after a few system calls that negotiate a larger heap).

However, each page-aligned page-sized chunk of virtual memory that you haven’t touched yet will initially lack a physical page backing. Virtual pages are mapped to physical pages lazily, (only) when the need arises.

That said, the “allocation” you are observing (as part of the first access to the big heap space) is happening a few layers of abstraction below what GCC can directly influence and is handled by your operating system’s paging mechanism.

Side note: Another consequence would be, for example, that allocating a 1 TB chunk of virtual memory on a machine with, say, 128 GB of RAM will appear to work perfectly fine, as long as you never access most of that huge (lazily) allocated space. (There are configuration options that can limit such memory overcommitment if need be.)

When you touch your newly allocated virtual memory pages for the first time, each of them causes a page fault and your CPU ends up in a handler in the kernel because of that. The kernel evaluates the situation and establishes that the access was in fact legit. So it “materializes” the virtual memory page, i.e. picks a physical page to back the virtual page and updates both its bookkeeping data structures and (equally importantly) the hardware page mapping mechanism(s) (e.g. page tables or TLB, depending on architecture). Then the kernel switches back to your userspace process, which will have no clue that all of this just happened. Repeat for each page.

Presumably, the description above is hugely oversimplified. (For example, there can be multiple page sizes to strike a balance between mapping maintenance efficiency and granularity / fragmentation etc.)

A simple and ugly way to ensure that the memory buffer gets its hardware backing would be to find the smallest possible page size on your architecture (which would be 4 kiB on a x86_64, for example, so 1024 of those integers (well, in most cases)) and then touch each (possible) page of that memory beforehand, as in: for (size_t i = 0; i < 0x80000000; i += 1024) buffer[i] = 1;.

There are (of course) more reasonable solutions than that↑; this is just an example to illustrate what’s happening and why.

Thanks for the nice explanation. I tried the for loop in steps of 1024, as I explained in one of the comments above, and it works great, just that it does not improve much in the allocation time compared to just setting all to zero in the constructor. I will look into using hugepage to see if that time can be reduced. — ferdymercury, Aug 31 '22 at 13:24
You need to write a non-zero value, otherwise the memory is just mapped to a "zero" page. — Dipstick, Aug 31 '22 at 13:25
Not really: I see no difference between writing 1 or writing 0. In both cases, the memory is allocated before the 'for loop'. Also, it takes in both cases the same time to allocate. — ferdymercury, Aug 31 '22 at 15:07
That may well be the case. I believe the point was that `0` *might* get optimized away by the kernel by swapping the page for a shared read-only zero page (perhaps later on), not that it *must* happen. — Andrej Podzimek, Aug 31 '22 at 16:40
Isn't `mlock` what the OP wants? Touching the pages doesn't make them stay in memory. — David Schwartz, Aug 31 '22 at 23:29
Unless the OP uses swap, anonymous (i.e. not `mmap()`ed) pages do stay in memory. Swap could indeed throw a wrench into the plan and `mlock()` is definitely a better option. (Personally I haven’t used swap for at least a decade; swap eats SSDs for lunch when unsupervised and suspend-to-disk (if needed) is more flexible with files.) — Andrej Podzimek, Sep 06 '22 at 12:28

Can I prevent the gcc optimizer from delaying memory allocation?

2 Answers2