Using SSE on floating point pixels with only 3 color components

Question

I am creating a struct to store a single RGB pixel in an image.

struct Pixel
{
    // color values range from 0.0 to 1.0
    float r, g, b;
}__attribute__((aligned(16));

I want to use 128 bit SSE instructions to do things like adding, multiplying, etc. This way I can perform operations on all 3 color channels at once. So the first packed float in my SSE register would be red, then green, then blue, but I am not so sure what would go into my fourth register. I really don't care what bits are in the extra 32 bits of padding. When I load a pixel into the SSE register I would imagine it contains either zeros or junk values. Is this problematic? Should I add a fourth alpha channel even though I don't really need one? The only way I see this being an issue is if I were dividing by a pixel and there was a zero value in the fourth spot, or of I was taking a root of a negative, etc.

It's *technically* not strictly conforming, so tools like `valgrind` or `address sanitizer` may complain. I don't see any reason why this shouldn't work in practice though. — EOF, Oct 04 '15 at 14:45
@EOF: There are technical reasons with FP data: FP exceptions, and huge performance penalties on some CPUs if your uninitialized data happens to represent a NaN, infinity, or denormal float. — Peter Cordes, Oct 04 '15 at 16:05
is there a reason you want to use such large precision for color channels? For storage they often use half-float instead, I'm not sure about processing — phuclv, Oct 06 '15 at 02:58

score 9 · Accepted Answer · edited May 23 '17 at 12:29

9

Integer ops will have no problem at all with uninitialized values, since the latency is never data-dependent. Floating point is different. Some FPUs slow down on denormals, NaNs, and infinities (in any one of the vector elements).

Intel Nehalem and earlier slow down a lot when doing math ops with denormal inputs/outputs, and on FP underflow/overflow. Sandybridge has a nice FPU with fast add/sub for any inputs (according to Agner Fog's instruction tables), but multiply can still slow down.

Add/sub/multiply are fine with zeros, but potentially a problem with uninitialized junk that might represent NaN or something.

Be careful with division that you aren't dividing by zero. That could even raise an FPU exception, depending on HW settings.

So yes, keeping the unused element zeroed is probably a good idea. Depending how you generate things in the first place, this may be pretty cheap to accomplish. (e.g. movd/pinsrd/pinsrd (or insertps) to put three 32bit elements into a vector, with the initial movd zeroing the high 96b.)

One workaround could be to store a 2nd copy of the blue channel in the 4th element. (or whatever is most convenient to shuffle there.) You could load vectors with movsldup(SSE3) / movlps. After movsldup, your register would hold { b b r r }. movlps would re-load the lower 64bits, so you'd have { b b g r }. (This is equivalent to movsd, BTW.) Or if the shuffle port is less busy than the load ports, do one 16B load and then shufps. (movsldup on Intel CPUs is a single uop that runs on a load port, even though it has the duplication built in.)

Another option would be to pack your pixels into 12 bytes, so a 16B load would get one component of the next pixel. Depending on what you're doing, overlapping stores that clobber one element of the next pixel might or might not be ok. Loading the next pixel before storing the current could work around that for some ops. It's quite easy to be cache or bandwidth-limited, so saving 1/4 space at the small cost of the occasional cache-line split load/store could be worth it.

edited May 23 '17 at 12:29

Community

1
1

answered Oct 04 '15 at 16:04

Peter Cordes

328,167
45
605
847

3

Sandybridge didn't totally eliminate SSE/AVX subnormal stalls (they're still present for some cases in Broadwell), though it did eliminate them in some of the most common cases. Dividing by zero is safe on SSE/AVX; it will set a flag, but it will not trap in the default floating-point environment. The risk of packing and loading from the next pixel is that if you have a number of pixels that is not a multiple of four, the final load at the end of the buffer reads (possibly unmapped) data you don't control. (Apologies for packing a bunch of unrelated notes into a comment). – Stephen Canon Oct 04 '15 at 18:18
@stephen: Thanks for the info on subnormal slowdowns and div by zero. I didn't mention but was thinking the best way to handle the write-past-the-end problem is to just allocate your buffer with padding. Or end your loop one iteration early, and handle the last triplet with `movlps / extractps`. Or use `palignr / movups` to do a 16B write that rewrites the high element of the previous pixel. (Since you still have it in a register when you break out of your loop.) I usually go for multiple topics in one comment instead of leaving 3 of 4 comments in a row, myself. :) – Peter Cordes Oct 04 '15 at 20:51
2

Agner Fog originally only tested denormals with addition/subtraction which were fast on Sandy Bridge. It wasn't until somebody [noticed it here](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x/9314926#comment11910304_9314926), did the story propagate all the way to Agner's blog where he re-tested and discovered that multiplication is still vulnerable to denormals on Sandy Bridge. – Mysticial Oct 05 '15 at 14:26
1

Wouldn't a AoSoA be a better solution? I mean e.g for SSE `struct PixelBlock4 { float r[4], g[4], b[4]; }__attribute__((aligned(16));`. Of course this would require making significant changes to a code base but if I was starting a new project I would consider this at the start. – Z boson Oct 07 '15 at 08:38
I did think about doing that. Having 3 arrays, one for red, one for green, one for blue. – chasep255 Oct 07 '15 at 10:56
@chasep255: Yes, Z boson is right that a planar RGB format would avoid having one wasted element in your computations. (Unless you pack pixels and unroll loops that do something different for each component by 3 or something.) A half-way format that puts all 3 components for nearby pixels in the same or nearby cache line is a nice idea. With AVX, you'd probably want the struct to use blocks of 8, or blocks of 16 for AVX512. – Peter Cordes Oct 07 '15 at 13:23
@chasep255, one thing with having an image in AoSoA is you need to convert it back to AoS to render in the end. That usually is not a problem because the calculations you are doing if you need SIMD should be much more intensive than the conversion. In any case you need to convert float to byte anyway and additionally the conversion is memory bandwidth bound. I'm not sure how to do the conversion in place so you need two buffers as well (AoSoA buffer and AoS buffer). You could probably do the conversion on the GPU. – Z boson Oct 08 '15 at 07:44

Using SSE on floating point pixels with only 3 color components

1 Answers1