Separable gaussian blur - optimize vertical pass

Question

I have implemented separable Gaussian blur. Horizontal pass was relatively easy to optimize with SIMD processing. However, I am not sure how to optimize vertical pass.

Accessing elements is not very cache friendly and filling SIMD lane would mean reading many different pixels. I was thinking about transpose the image and run horizontal pass and then transpose image back, however, I am not sure if it will gain any improvement because of two tranpose operations.

I have quite large images 16k resolution and kernel size is 19, so vectorization of vertical pass gain was about 15%.

My Vertical pass is as follows (it is sinde generic class typed to T which can be uint8_t or float):

int yStart = kernelHalfSize;
int xStart = kernelHalfSize;
int yEnd = input.GetWidth() - kernelHalfSize;
int xEnd = input.GetHeigh() - kernelHalfSize;

const T * inData = input.GetData().data();
V * outData = output.GetData().data();

int kn = kernelHalfSize * 2 + 1;
int kn4 = kn - kn % 4;

for (int y = yStart; y < yEnd; y++)
{
    size_t yW = size_t(y) * output.GetWidth();
    size_t outX = size_t(xStart) + yW;

    size_t xEndSimd = xStart;

    int len = xEnd - xStart;
    len = len - len % 4;
    xEndSimd = xStart + len;

    for (int x = xStart; x < xEndSimd; x += 4)
    {
        size_t inYW = size_t(y) * input.GetWidth();
        size_t x0 = ((x + 0) - kernelHalfSize) + inYW;
        size_t x1 = x0 + 1;
        size_t x2 = x0 + 2;
        size_t x3 = x0 + 3;

        __m128 sumDot = _mm_setzero_ps();

        int i = 0;
        for (; i < kn4; i += 4)
        {               
            __m128 kx = _mm_set_ps1(kernelDataX[i + 0]);
            __m128 ky = _mm_set_ps1(kernelDataX[i + 1]);
            __m128 kz = _mm_set_ps1(kernelDataX[i + 2]);
            __m128 kw = _mm_set_ps1(kernelDataX[i + 3]);

            
            __m128 dx, dy, dz, dw;

            if constexpr (std::is_same<T, uint8_t>::value)
            {
                //we need co convert uint8_t inputs to float
                __m128i u8_0 = _mm_loadu_si128((const __m128i*)(inData + x0));
                __m128i u8_1 = _mm_loadu_si128((const __m128i*)(inData + x1));
                __m128i u8_2 = _mm_loadu_si128((const __m128i*)(inData + x2));
                __m128i u8_3 = _mm_loadu_si128((const __m128i*)(inData + x3));

                __m128i u32_0 = _mm_unpacklo_epi16(
                    _mm_unpacklo_epi8(u8_0, _mm_setzero_si128()),
                    _mm_setzero_si128());
                __m128i u32_1 = _mm_unpacklo_epi16(
                    _mm_unpacklo_epi8(u8_1, _mm_setzero_si128()),
                    _mm_setzero_si128());
                __m128i u32_2 = _mm_unpacklo_epi16(
                    _mm_unpacklo_epi8(u8_2, _mm_setzero_si128()),
                    _mm_setzero_si128());
                __m128i u32_3 = _mm_unpacklo_epi16(
                    _mm_unpacklo_epi8(u8_3, _mm_setzero_si128()),
                    _mm_setzero_si128());

                dx = _mm_cvtepi32_ps(u32_0);
                dy = _mm_cvtepi32_ps(u32_1);
                dz = _mm_cvtepi32_ps(u32_2);
                dw = _mm_cvtepi32_ps(u32_3);

            }
            else
            {
                /*
                //load 8 consecutive values
                auto dd = _mm256_loadu_ps(inData + x0);

                //extract parts by shifting and casting to 4 values float
                dx = _mm256_castps256_ps128(dd);
                dy = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 4, 3, 2, 1)));
                dz = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 5, 4, 3, 2)));
                dw = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 6, 5, 4, 3)));
                */

                dx = _mm_loadu_ps(inData + x0);
                dy = _mm_loadu_ps(inData + x1);
                dz = _mm_loadu_ps(inData + x2);
                dw = _mm_loadu_ps(inData + x3);
            }

            //calculate 4 dots at once
            //[dx, dy, dz, dw] <dot> [kx, ky, kz, kw]

            auto mx = _mm_mul_ps(dx, kx); //dx * kx
            auto my = _mm_fmadd_ps(dy, ky, mx); //mx + dy * ky
            auto mz = _mm_fmadd_ps(dz, kz, my); //my + dz * kz
            auto res = _mm_fmadd_ps(dw, kw, mz); //mz + dw * kw

            sumDot = _mm_add_ps(sumDot, res);

            x0 += 4;
            x1 += 4;
            x2 += 4;
            x3 += 4;
        }

        for (; i < kn; i++)
        {               
            auto v = _mm_set_ps1(kernelDataX[i]);
            auto v2 = _mm_set_ps(
                *(inData + x3), *(inData + x2), 
                *(inData + x1), *(inData + x0)
            );
            
            sumDot = _mm_add_ps(sumDot, _mm_mul_ps(v, v2));

            x0++;
            x1++;
            x2++;
            x3++;
        }

        sumDot = _mm_mul_ps(sumDot, _mm_set_ps1(weightX));

        if constexpr (std::is_same<V, uint8_t>::value)
        {
            __m128i asInt = _mm_cvtps_epi32(sumDot);
            
            asInt = _mm_packus_epi32(asInt, asInt);
            asInt = _mm_packus_epi16(asInt, asInt);

            uint32_t res = _mm_cvtsi128_si32(asInt);

            ((uint32_t *)(outData + outX))[0] = res;                
            outX += 4;
        }
        else 
        {
            float tmpRes[4];
            _mm_store_ps(tmpRes, sumDot);

            outData[outX + 0] = tmpRes[0];
            outData[outX + 1] = tmpRes[1];
            outData[outX + 2] = tmpRes[2];
            outData[outX + 3] = tmpRes[3];
            outX += 4;
        }
        
    }   

    for (int x = xEndSimd; x < xEnd; x++)
    {
        int kn = kernelHalfSize * 2 + 1;
        const T * v = input.GetPixelStart(x - kernelHalfSize, y);
        float tmp = 0;
        for (int i = 0; i < kn; i++)
        {
            tmp += kernelDataX[i] * v[i];
        }
        tmp *= weightX;
        outData[outX] = ImageUtils::clamp_cast<V>(tmp);
        outX++;
    }
}

Show us what you got so far! It would also be very important to know what input/output type you have, what target architecture you want to support, whether you need the exact same output or an approximation suffices, etc. The 15% gain was in total, or just on the vertical pass? Did you check if, e.g., OpenCV already does some SIMD optimization on its Gaussian blur implementation? — chtz, Jun 28 '20 at 11:02
There are a number of optimisations you can use on the vertical pass - two in particular are: (a) process more than vector per row to get a better cache footprint and (b) make use of the redundant loads for all the overlapping consecutive rows, e.g. operate on 22 rows simultaneously to generate 4 output rows (reduces loads by around 4x). — Paul R, Jun 28 '20 at 11:41
`//we need co convert uint8_t inputs to float`: You may instead try a `PMADDUBSW` based solution (i.e., do everything using 8+8 bit fixed point numbers). You'll be off by a bit compared to you `float` based solution. Are you limited to SSE2? — chtz, Jun 28 '20 at 13:14

Soonts · Answer 1 · 2020-06-30T21:48:27.320

2

There’s a well-known trick for that.

While you compute both passes, read them sequentially, use SIMD to compute, but write out the result into another buffer, transposed, using scalar stores. Protip: SSE 4.1 has _mm_extract_ps just don’t forget to cast your destination image pointer from float* into int*. Another thing about these stores, I would recommend using _mm_stream_si32 for that as you want maximum cache space used by your input data. When you’ll be computing the second pass, you’ll be reading sequential memory addresses again, the prefetcher hardware will deal with the latency.

This way both passes will be identical, I usually call same function twice, with different buffers.

Two transposes caused by your 2 passes cancel each other. Here’s an HLSL version, BTW.

There’s more. If your kernel size is only 19, that fits in 3 AVX registers. I think shuffle/permute/blend instructions are still faster than even L1 cache loads, i.e. it might be better to load the kernel outside the loop.

edited Jun 30 '20 at 21:48

answered Jun 30 '20 at 16:48

Soonts

20,079
9
57
130

Careful with `_mm_extract_ps` - it *can* compile to a memory-destination store, but note that the intrinsic return value has type `int` so you need to type-pun it back to `float` to actually store an FP bit-pattern. Otherwise assigning `int` to `float` will take the FP bit-pattern as an int, *convert it to float*, and then store. Like `extractps eax, xmm0, 1` / `cvtsi2ss eax, xmm1` / `movss [mem], xmm0` /facepalm – Peter Cordes Jun 30 '20 at 20:53
Not sure it’s relevant for OP, they need it in memory anyway, casting pointers is easy. For that use case I would even recommend `_mm_stream_si32` for that store, you’ll only be reading way later but you want all available cache space for the source image and especially for the convolution kernel. – Soonts Jun 30 '20 at 21:34
I'm just saying that `_mm_extract_ps` is hard to use; just mentioning it in passing without mentioning the gotchas is maybe not super helpful. Also, `_mm_stream_si32` would lead to extracting to an integer reg and then `movnti` on that. If you want NT stores, you'd want to do FP shuffles to combine 4 elements for `movntps`. Or if you aren't doing full-line stores, you don't actually want NT at all; let L1d absorb your scattered writes. – Peter Cordes Jun 30 '20 at 21:42
@PeterCordes Mentioned. It’s very complicated to do full-line stores when writing transposed output, you’ll have to do 4-8 lines at the same time in the inner loop. Doable, but the overhead of these shuffles of outputs / broadcasts of kernel eats most of the performance gained. In my experience, the hardware is pretty good at random writes as long as the written memory ain’t read back too soon. – Soonts Jun 30 '20 at 21:53
Then you don't want NT stores; there are only about a dozen cache-line-sized buffers (and regular loads / stores compete for them too, to transfer data from L2 cache). They have to evict all the way to RAM whether they're full or not, if a new one is needed. A full-line store can avoid an RFO (and just invalidate), but a partial-line NT store is less efficient; I don't know the details of how multi-core CPUs handle it but it's something you want to avoid. – Peter Cordes Jun 30 '20 at 22:14
@PeterCordes Not sure about that. I think CPU does what the documentation says, stores it bypassing caches. If they could only store 16-bytes at a time (dual-channel DDR bus width) or 64 bytes at a time (cache line), and only through caches, I don’t think memory mapped IO would work, as it does in most device drivers. – Soonts Jun 30 '20 at 22:34
It's supported for correctness of course, but it's not efficient AFAIK. WC memory and NT stores exist for write *combining*. e.g. [Why wasn't MASKMOVDQU extended to 256-bit and 512-bit stores?](https://stackoverflow.com/q/54931225) discusses some of the possible issues. http://download.intel.com/design/PentiumII/applnots/24442201.pdf from 1998 mentions that only a full-line write can commit fully efficiently. It doesn't sound too bad there, working with 8-byte granularity, but it could be worse with DDR4 which doesn't guarantee byte enable signals. So it might RMW in the mem controller – Peter Cordes Jun 30 '20 at 23:03
https://patents.google.com/patent/US20010013870 (Intel 1998) talks about partial WC evictions being "expensive" (and a mechanism to maybe handle that better). But I think they mean expensive compared to the ideal case where a full-line burst is possible, if stores to a full line are somewhat nearby in time just mixed with other ops. (So maybe not like your case.) http://blog.andy.glew.ca/2011/06/write-combining.html also talks a bit about strategies for handling eviction of not-full WC buffers. It's possible that it's not as bad as I think it is, but it does defeat L3 cache from helping. – Peter Cordes Jun 30 '20 at 23:17
@PeterCordes “mentions that only a full-line write can commit fully efficiently” true, but I’m pretty sure by “fully efficiently” they mean “saturate write bandwidth”. It matters for e.g. memcpy, not for convolution. Even extremely optimized code with the complete kernel in vector registers gonna read from RAM by at least 1 order of magnitude more than it writes there. – Soonts Jun 30 '20 at 23:36
Ok sure, but that doesn't mean NT stores are necessarily a good idea. Worth trying, but slow NT stores that can't benefit from HW prefetch could fill the store buffer and stall progress worse than the alternative. Some extra cache pollution can lead to sometimes pulling data from farther away, but with good HW prefetch that might be mitigated. It probably becomes a latency vs. bandwidth tradeoff for different levels of cache. – Peter Cordes Jul 01 '20 at 00:16

Separable gaussian blur - optimize vertical pass

1 Answers1