1

i need to initialize every node of a tree with something like:

  this->values=(float*) _aligned_malloc(mem * sizeof(float), 32);
  this->frequencies =(float*) _aligned_malloc(mem * sizeof(float), 32);

where mem is rather big(~100k-1m), values are 0s and frequencies==1/numChildren (arbitrary float for each node)

the fastest(although by a small amount) was std:fill_n:

    std::fill_n(this->values, mem, 0);
    std::fill_n(this->frequencies , mem,1/(float)numchildren);

i thought using avx2 intrinsics would've made it faster, something like:

    float v = 1 / (float)numchildren;
    __m256 nc = _mm256_set_ps(v, v, v, v, v, v, v, v);
    __m256 z = _mm256_setzero_ps();
    for (long i = 0; i < mem; i += 8)
    {
        _mm256_store_ps(this->frequencies + i, nc);
        _mm256_store_ps(this->values + i, z);
    }

this was actually a bit slower, and as slow as naive

        for (auto i = 0; i < mem; i++)
        {
            this->values[i] = 0;
            this->frequencies[i] = 1 / (float)numchildren;
        }

i assume that intrinsics may actually copy arguments on each call, but since all values are the same, i want to load them into 1 register just once and move to different memory locations multiple times and i think it's not what's happening here.

john smith
  • 11
  • 2
  • If you don't mind later pagefaults for lazy allocation, allocating your `values` with `mmap` or `VirtualAlloc` would give you zeroed memory that's only actually allocated as its touched. There unfortunately isn't a portable C++ `aligned_calloc`, although I think there might be on MSVC if that's what you're using. – Peter Cordes May 27 '22 at 17:28
  • 1
    As for memset-like loops, see [Why is std::fill(0) slower than std::fill(1)?](https://stackoverflow.com/q/42558907) for some details of what it does under the hood. And [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) re: NT stores vs. "regular" stores. I'm not surprised the naive loop auto-vectorized to the same as your manual loop storing `_mm256_set1_ps(v)` (which is a more convenient way to write that, BTW). You might consider trying two separate loops, instead of writing two output streams. It might be faster or slower depending on the CPU and compiler. – Peter Cordes May 27 '22 at 17:31
  • Why were you expecting `std::fill_n` to be slower? Just curious – Erik Eidt May 27 '22 at 17:33
  • every element will be touched, so there's no benefit of delaying it. and i'm particularly interested in non-zero values, ie one node may have 1mllion values of 0.25 whereas another node has one million values of 0.2 etc – john smith May 27 '22 at 17:35
  • i didn't expect std::fill_n to simultaneously copy 8 floats at once with 1 assembler command, that's how i expected to gain a speedup – john smith May 27 '22 at 17:37
  • Define a `struct alignas(32) Float8 { float f[8]; };` and allocate an array of that. That should make the compiler know about the alignment and vectorise even the naive loop the best way. – Goswin von Brederlow May 27 '22 at 17:41
  • so it seems that trying to use avx2 instructions won't give much of a performance boost given compiler's optimization? does that mean i can safely modify them one by one in further iterations without trying to update 8 values simultaneously? that would save a lot of time coding but performance is the top priority – john smith May 27 '22 at 17:50
  • 1
    Manual AVX optimization can give a lot of performance boost in many cases, but this one is not one of those. The compiler writers have already done a lot of tests to know what is the most efficient way to implement `std::fill_n` or `memset`. Don't forget having `-mavx2` turned on. It could matter if the compiler decides to inline those functions. – xiver77 May 27 '22 at 18:54
  • the naive implementation is working slightly faster even without avx flags, also it takes ~1.5 seconds to allocate+initialize ~7.5gb of ram this way, which i assume should take fractions of a second?? – john smith May 27 '22 at 18:58
  • Look at the assembly output of your naive implementation. See the difference to yours, and you'll know why yours is a bit slower. If you write the exact same code either with intrinsics or straight assembly, you will get the same performance. Also use `@username` for a fast reply. – xiver77 May 27 '22 at 19:01
  • @johnsmith: 1.5 seconds includes a page fault on each page the first time it's written. (Or hopefully one per multiple pages if the kernel notices the sequential access and allocates (+ wires into the page table) multiple zeroed pages after the one that faulted, so the next soft page-fault won't happen until your user-space store loop gets to the end of that range of allocations. That's where most of the time goes, including the kernel internally doing a `memset(page, 0, 4096)` for each page. At least that makes the page hot in cache right before you write it in user-space. – Peter Cordes May 27 '22 at 20:13
  • @PeterCordes i can traverse the entire tree and calculate memory needed in negligible time(~10-20ms), so shall i allocate all that memory at once and map it to virtual adress space to avoid page faults and then allocate memory for each node from that pool and the 2nd pass of the tree? also, i only need half of that memory to be 0s, second half would be initialized with different float values, that's half a billion writes, given that i write 8 floats at once – john smith May 27 '22 at 21:20
  • If you want to pre-fault the zeroed memory for later use instead of letting it get faulted in as it's touched, you still only need to touch 1 byte (or 1 float or 1 SIMD vector) per page, with an actual write. Perhaps write 1/2 page (2048 bytes) of the other array, then store a zero to this, then the other half. So you interleave your page faults with stores, which can hopefully drain from the store buffer to L1d / L2 / L3 / DRAM while the CPU is stalled switching to the kernel. Minimizing bubbles in your memory bandwidth, and letting HW prefetch do RFOs, too. Worth trying at least. – Peter Cordes May 27 '22 at 22:58
  • Or if VirtualAlloc or VirtualAllocEx has an option to pre-fault all the pages so they're backed by separate physical pages, then you don't need to touch it. – Peter Cordes May 27 '22 at 22:59

1 Answers1

1

By _aligned_malloc I assume Windows.

In Windows, you can allocate with VirtualAlloc big amounts of memory, it would be page-aligned (4096 bytes), and will be already zeroed by OS, which is likely faster than manual zeroing.

Note that VirtualAlloc is always a kernel call, but a huge _aligned_malloc is very likely to be a kernel call anyway.

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • the problem is not allocation, rather initialization with non-zero floats. initializing with avx2 intrinsics (which should initialize 8 float values simultaneously) didn't improve speed comparing to naive non-avx2 initialization, so i assume i'm doing something wrong(ie unnecessary copying, non efficient use of register, etc ) – john smith May 27 '22 at 19:56
  • @johnsmith: You're missing the point of this answer. The bit-pattern for float `0.0f` is all bits zero, so zeroed memory straight from the kernel already is an array of `0.0f`. You can *just* allocate and (for now) avoid writing this memory at all. – Peter Cordes May 27 '22 at 20:07