I'm learning how to use SIMD instructions using C and I want to compare code using SIMD to code without. Does anyone have a testing template that accurately identifies the expected speedup of SIMD code versus standard code?
Specifically I've noticed the following approximate performance times with specific configurations:
SIMD first, single run:
SIMD: 0.15 s
standard: 0.35 s
SIMD first, standard second, repeated 10x:
SIMD: 0.15 s - first run, 0.05 s on subsequent runs
standard: 0.35 s - first run, 0.34 on subsequent runs
standard first, SIMD second, repeated 10x:
standard: 0.45 s - first run, 0.35 s on subsequent runs
SIMD: 0.05 s - first run, 0.05 s on subsequent runs
The code example is running through a dataset of 1e8 values of type uint16_t. Data allocation and initialization is outside the loop. If I allocate the data inside the repetition loop the loops all have the same timing. If I do this before the SIMD and standard sections, rather than just before whichever comes first, I get the larger times for both:
standard: 0.45 s
SIMD: 0.15 s
So why is data allocation causing such time differences? What is the real speedup?
Link to code: https://gist.github.com/JimHokanson/55ce2e5cac75d7df6dc24dadf383e68f
I'm testing on a Early 2016 Macbook with a m3 processor ...