I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel).
for (int i = 0, max = width * height * 4; i < max; i+=4) {
data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255;
data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255;
data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255;
}
You will find below my current implementation but I think it could be much faster and I'm wasting precious CPU cycles. I tested it with quick-bench.com and it shows encouraging results but what should I change to make it blazing fast?
Thanks
-------- UPDATED 09/06/2019 --------
Based on @chtz and @Peter Cordes comments I put together a repository to assess the different solutions here are the results. Do you think it can be better?
Run on (8 X 3100 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 262K (x4)
L3 Unified 8388K (x1)
Load Average: 1.24, 1.60, 1.68
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
v1_plain_mean 1189884 ns 1189573 ns 1000 itr=840.865/s
v1_plain_median 1184059 ns 1183786 ns 1000 itr=844.747/s
v1_plain_stddev 20575 ns 20166 ns 1000 itr=13.4227/s
v1_simd_x86_mean 297866 ns 297784 ns 1000 itr=3.3616k/s
v1_simd_x86_median 294995 ns 294927 ns 1000 itr=3.39067k/s
v1_simd_x86_stddev 9863 ns 9794 ns 1000 itr=105.51/s
Thanks Dot and Beached (discord #include)
v2_plain_mean 323541 ns 323451 ns 1000 itr=3.09678k/s
v2_plain_median 318932 ns 318855 ns 1000 itr=3.13623k/s
v2_plain_stddev 13598 ns 13542 ns 1000 itr=122.588/s
Thanks Peter Cordes (stackoverflow)
v3_simd_x86_mean 264323 ns 264247 ns 1000 itr=3.79233k/s
v3_simd_x86_median 260641 ns 260560 ns 1000 itr=3.83788k/s
v3_simd_x86_stddev 12466 ns 12422 ns 1000 itr=170.36/s
Thanks chtz (stackoverflow)
v4_simd_x86_mean 266174 ns 266109 ns 1000 itr=3.76502k/s
v4_simd_x86_median 262940 ns 262916 ns 1000 itr=3.8035k/s
v4_simd_x86_stddev 11993 ns 11962 ns 1000 itr=159.906/s
-------- UPDATED 10/06/2019 --------
I added the AVX2 version and used chtz's tip. Using 255 for alpha value in color_odd I was able to remove _mm_blendv_epi8
and improve the benchmark.
Thanks Peter and chtz
v3_simd_x86_mean 246171 ns 246107 ns 100 itr=4.06517k/s
v3_simd_x86_median 245191 ns 245167 ns 100 itr=4.07885k/s
v3_simd_x86_stddev 5423 ns 5406 ns 100 itr=87.13/s
// AVX2
v5_simd_x86_mean 158456 ns 158409 ns 100 itr=6.31411k/s
v5_simd_x86_median 158248 ns 158165 ns 100 itr=6.3225k/s
v5_simd_x86_stddev 2340 ns 2329 ns 100 itr=92.1406/s