I've implemented a vectorized version of the Black-Scholes formula using 256-bit SIMD and have written an unscientific benchmark that is telling me I'm getting about 20x performance boost, which is pretty good.
One of my lines of code initializes a 256-bit vector with four double values of 0.5:
const static __m256d half = {0.5,0.5,0.5,0.5};
Now here's the thing: if I replace the above with
const static __m256d half = _mm256_set1_pd(0.5);
Then I only get 1/4 of the performance boost! I can get the performance back if I remove the static
part, but why?
Link to full example in Compiler Explorer: https://godbolt.org/z/vBq_fz
I am compiling the example in MSVC, 64-bit, VS 2019, Release mode. This strange difference does not show up in Debug mode.