i need to initialize every node of a tree with something like:
this->values=(float*) _aligned_malloc(mem * sizeof(float), 32);
this->frequencies =(float*) _aligned_malloc(mem * sizeof(float), 32);
where mem is rather big(~100k-1m), values are 0s and frequencies==1/numChildren (arbitrary float for each node)
the fastest(although by a small amount) was std:fill_n:
std::fill_n(this->values, mem, 0);
std::fill_n(this->frequencies , mem,1/(float)numchildren);
i thought using avx2 intrinsics would've made it faster, something like:
float v = 1 / (float)numchildren;
__m256 nc = _mm256_set_ps(v, v, v, v, v, v, v, v);
__m256 z = _mm256_setzero_ps();
for (long i = 0; i < mem; i += 8)
{
_mm256_store_ps(this->frequencies + i, nc);
_mm256_store_ps(this->values + i, z);
}
this was actually a bit slower, and as slow as naive
for (auto i = 0; i < mem; i++)
{
this->values[i] = 0;
this->frequencies[i] = 1 / (float)numchildren;
}
i assume that intrinsics may actually copy arguments on each call, but since all values are the same, i want to load them into 1 register just once and move to different memory locations multiple times and i think it's not what's happening here.