1

I wrote my own malloc new and realloc for my C++ project. Some of these pages are >= 4K. I was wondering when I call my malloc is there a way I can zero out the 4K+ page without reading the data into cache? I vaguely remember reading about something like this in either intel or AMD x86-64 documentation but I can't remember what it's called.

Does gcc (or clang) have an intrinsic I can use? If not what assembly instructions should I look up? I have 3 common use cases after a malloc. zeroing the memory, memcpy-ing a buffer and mixing both (64bytes or 512 of memcpy then rest as zeros). I'm not sure what will be the miminum architecture I'll support but it's no less then haswell. Likely it'll be Intel Skylake/AMD Zen and up

-Edit- I rolled back the C++ tag to C because generally intrinsic is in C

Cal
  • 121
  • 8
  • 3
    The `new`/`realloc` pair makes less sense than the struck out `malloc`/`realloc` pair. – xiver77 Jul 02 '22 at 22:38
  • @xiver77 I guess but I use my function by calling new which is overloaded to call my memory allocator – Cal Jul 02 '22 at 22:39
  • How exactly is the allocation done in system level? Modern OSes usually give you zeroed pages from the beginning. – xiver77 Jul 02 '22 at 22:42
  • 2
    Do you want to let the kernel zero fresh pages for you via page faults when you touch them? Or do you want to use MOVNT stores (including AMD's `clzero`) to actually write to memory? The latter would allow zeroing (re)allocations from the free-list, avoiding expensive system calls and page-faults, but if you're probably getting fresh memory anyway, use `calloc`. But NT stores aren't good if you might re-read that memory soon, in that case you'd *want* it cached. – Peter Cordes Jul 02 '22 at 22:43
  • @PeterCordes that's a hard question. I haven't profiled allocations so I don't know if I'm doing any mallocs+free in a loop but if I was wouldn't the page faults be very bad? NT stores that don't enter cache sounds bad too. I mostly want to say give me L1 cache that's flushed to L2/L3/ram but I don't want to read from ram to do it. calloc doesn't make sense here because libc don't know about my pointers since I allocate them in a single 1GB chunk (for new/malloc, realloc start at 4K mmap) – Cal Jul 02 '22 at 23:13
  • If your memory consumption is predictable in some way, you can just allocate a huge chunk (say 10GB) when the program starts, and physically zero it in whatever way. If you do all your allocation within this area during program run, the possible overheads from the system's memory management and zeroing are only at the beginning. – xiver77 Jul 02 '22 at 23:36
  • @PeterCordes I just checked glibc's calloc, and it seems to just toss the system's COW page if possible. – xiver77 Jul 03 '22 at 00:01
  • @xiver77: You still pay for copy-on-write page faults if you just used `mmap` without writing the pages. But if you did write them, you're wasting 10G of physical memory! And you'd still have to pay for TLB misses if you keep using new pages instead of re-zeroing dirty memory on the free list. – Peter Cordes Jul 03 '22 at 00:33
  • @Cal: yes, page faults are quite slow. `rep stosb` zeroing might be a good way of getting no-RFO stores for zeroing a page or so but still leaving L1d cache hot. (But potentially dirty; no good option to hint that it should write-back without evicting from L1d. `cldemote` evicts, and `clwb` is not a hint so it's more expensive, and before IceLake is handled as `clflushopt`.) See [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) re: `rep movsb` which is similar. – Peter Cordes Jul 03 '22 at 00:36
  • @PeterCordes You advised to "use `calloc`", so I thought `calloc` does some more than an `mmap`, but it seemed basically the same when a new `mmap` was called. Wouldn't it be more natural to implement `malloc` with `mmap` rather than `calloc`? I'll go back to your last sentence when I write a custom allocator. – xiver77 Jul 03 '22 at 00:41
  • If you sometimes need to memcpy-then-zero, perhaps best to just `malloc` and `memcpy` the part you want, then `memset` the rest to zero right then. Rather than trying to zero the whole thing at some earlier point. (Or use `calloc` if you're only writing to a small part of the total allocation). Otherwise only if you want to allocate but not write at all would it make sense to use `calloc` – Peter Cordes Jul 03 '22 at 00:41
  • @xiver77: I think I missed / forgot something about this question; it's rolling its own `malloc`? Yeah, you'd probably want to use `mmap` yourself for new pages or groups of pages, if your free-list is empty (otherwise a bit of memory traffic to rezero a page is probably cheaper than a syscall + page fault + page-table manipulation; besides Linux will `rep stosb` to zero a page for you in the page fault handler). `calloc` is useful when you're not writing your own allocator, so you can get guaranteed-zero memory that avoids dirtying it if possible. – Peter Cordes Jul 03 '22 at 00:46
  • @PeterCordes yeah rolling my own malloc. I'm sure this isn't a problem I just like it when I can have things go fast. I haven't measured or profile but my thought was IF I did something like parse a 64mb json file I'd have it all on a bump allocator so I can delete everything at once not a problem. However the data I copied out would be unpredictable and fragmented and I may recycle > 64MB of data which certainly won't fit in my cache so I was hoping I can zero the data in L1 without waiting for L1->L2->L3->RAM only to ignored it all anyway – Cal Jul 03 '22 at 01:45
  • It sounds like I have no solutions which is somewhat surprising but I guess there could be a hardware reason why implementing what I want is difficult. But I'm no designer. I wish I would write RTL or get my own instruction set on a FPGA – Cal Jul 03 '22 at 01:47
  • Yes, `rep stosb` can zero memory with no-RFO stores, avoiding the "read for ownership" part of normal stores. (IDK what the size threshold is for that happening). So can NT stores, but they leave L1d cold. – Peter Cordes Jul 03 '22 at 01:48
  • @PeterCordes I was writing my previous comment and didnt read REP MOVSB and such yet. Whats RFO? I cant tell in this 30second google. Does NT store leave l2 or l3 hot? Because that's a lot better than ram – Cal Jul 03 '22 at 01:50
  • RFO is "read for ownership", part of MESI; read [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) - that's why I linked it. NT stores are guaranteed to evict from all levels of cache. – Peter Cordes Jul 03 '22 at 01:56
  • @PeterCordes ty. I just finished reading the opening post when you made your comment. I'm running tinymembench from that link right now for fun – Cal Jul 03 '22 at 01:58

2 Answers2

2

Under Unix systems you can mmap /dev/zero to get zero filled pages. That would give you zeroed pages for sure. Depending on the kernel MAP_ANNONYMOUS might also give you zero filled pages. Both ways should not poison the caches.

You can also use MAP_POPULATE (Linux) to allocate physical pages from the start instead of faulting them in on first access. Hopefully this wouldn't poison the caches either but I never verified that in the Linux source.

But I have to wonder: Why would you zero out the pages on malloc/realloc/new? Only calloc zeroes out pages and for everything else the compiler or source code will zero out the memory. Unless you change the compiler to know about you already zeroing out the pages there won't be any benefit.

Note: For many types in C++ the memory will not zeroed out at all but initialized using the constructors.

Goswin von Brederlow
  • 11,875
  • 2
  • 24
  • 42
  • PS: this only holds if the kernel has pre-zeroed pages available. Otherwise it will zero them when it maps them and then the cache is poisoned. – Goswin von Brederlow Jul 02 '22 at 23:15
  • "Why would you zero out the pages" <-- I have 3 specific parts of my code that need to go fast that can use a faster allocating algorithm than what malloc and jemalloc provide. It's specific to my usecase. Some of it transform data into a tree, then to a flat file then free, then allocates new data for a tree and always in specific sizes. It expects the memory to be zero so I need to write zeroes when I reuse memory – Cal Jul 02 '22 at 23:18
  • 3
    @Cal That should be calling `calloc` then and that case you can optimize. But it really only helps on the first allocation I would bet. If you `munmap` and then `mmap` to reuse memory that is 2 syscalls and the kernel will have to zero out the pages for you. I can see 2 cases where that might be faster: 1) your code isn't multithreaded and the kernel can zero pages in the background using an idle core. 2) the kernel uses a DMA engine to zero pages. Otherwise just zeroing them yourself will be faster. Also look into compiler builtins or inline asm for non-caching stores to memory. – Goswin von Brederlow Jul 02 '22 at 23:28
  • @Cal: If you already have a page allocated and dirty, it's faster to re-zero it yourself, regardless of cache pollution. (You *could* `mmap(MAP_FIXED|MAP_ANONYMOUS)` to replace the region with fresh lazily-allocated pages that will copy-on-write map to a system-wide shared page of zeros if you read before writing, with only one system call. Like Linux `madvise(MADV_FREE)`, but guaranteed to take effect now, not just on memory pressure.) When you need fresh pages, though, you should let the allocator optimize the potential zeroing for you by calling `calloc` to avoid writing if it used mmap. – Peter Cordes Jul 03 '22 at 00:28
  • Or if you're writing your own allocator, then you should be using `mmap(MAP_ANONYMOUS)`. mmap of `/dev/zero` is ancient history. Or at least I thought it was, seems it's not part POSIX! Only BSD, SVID, etc. - most real-world systems do support it. If supported, it guarantees(?) `MAP_ANONYMOUS` gives zeroed pages. (On Linux there's a `MAP_UNINITIALIZED` you can use to override that, but only some embedded systems allow that at all, because it leaks data to user-space.) – Peter Cordes Jul 03 '22 at 00:51
1

I think rep stosb meets your needs. Even though it does 1-byte writes, it uses write combining internally, so it will fill a full cache line before issuing a write. Then since an entire cache line is being written, it doesn't need to read the dead contents of memory before writing the line to L1.

prl
  • 11,716
  • 2
  • 13
  • 31
  • You mean `rep stosb` for pure zeroing; you don't want to copy zeros from anywhere. You can use `rep movsb` for no-RFO (read for ownership) copying for the cases where some copying is desired, before zeroing the rest of the page with `rep stosb`. But yes, [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) has some info about the no-RFO stores that optimized microcode can do, at least for large enough copies. – Peter Cordes Jul 03 '22 at 07:47
  • Duh, yes, of course, thanks. – prl Jul 03 '22 at 07:58