What is the best way to use SIMD / assembler to subtract 2 uint16s with absolute value (max difference) and add (+=) the result to a float?
Similar to this C'ish example
c0 += fabs((float)a0 - (float)b0); // C is Float accumulator, a+b pixels
where a and b are unsigned 16 bit words and c is a float. Only 1 word -> float conversion rather than 3.
Thee application is processing raw, 16-bit unsigned int image data on as many full, RGB pixels as is possible at once.
Possibly using AVX2/SSE4.2 on a Skylake Xeon E3-1275 v5?
5 minute comment limit?? Can't save or re-edit???
Are you sure you need float? Uint16 can't accumulate more than 1 subtraction. I want to do a neighborhood contrast calc so I need to sum at least 8 differences. There are (2D+1)^2-1 neighbors in a neighborhood with D depth. I also want to be able to square the difference which is where a uint32 can be too small. I think the floats look smoother too.
Here is a bit more background on what is already working and how I want to improve it.
To clarify, my current C code calculates the per-channel differences between a fixed home pixel and 8 or more neighbors. It has a 5 deep nested loop structure: Y-rows then X-cols for each pixel in the image (36 Million) Channels, R. G & B are loop3 Loops 4 and 5 are for the Rows and Columns of the neighborhood.
for each HOME pixel
clear the R, G and B accumulators
for each neighbor,
add abs(home_red - nabr_red) to red_float_accumulator
same for green and blue
copy accumulated values to main memory
My next step was to move the channels to level 5 and do all 3 subtractions, R, G and B simultaneously with SIMD. With 48 bits/pixel and 128 bits available per MMX register, 2 can be done at once instead of just 1.
With 512 bit registers in the AVX2 on the Skylake Xeon, 10 could be done. I am looking for a good strategy to balance complexity with performance and to learn more about these vector operations.
I need R, G and B accumulators for each "home" pixel. Then move the RGB into a "float image" with the same XY resolution as the uint16/channel RAW, RGB file. Do the same contrast calculation for each pixel.