How to convert 3 addition and 1 multiply into vectorized SIMD using intrinsic functions C++

Question

I'm working with a problem using 2D prefix sum, also called Summed-Area Table S. For an 2D array I (grayscale image/matrix/etc), its definition is:

S[x][y] = S[x-1][y] + S[x][y-1] - S[x-1][y-1] + I[x][y]
Sqr[x][y] = Sqr[x-1][y] + Sqr[x][y-1] - Sqr[x-1][y-1] + I[x][y]^2

Calculating the sum of a sub-matrix with two corners (top,left) and (bot,right) can be done in O(1):

sum = S[bot][right] - S[bot][left-1] - S[top-1][right] + S[top-1][left-1]

One of my problem is to calculate all possible sub-matrix sum with a constant size (bot-top == right-left == R), which are then used to calculate their mean/variance. And I've vectorized it to the form below.

lineSize is the number of elements to be processed at once. I choose lineSize = 16 because Intel CPU AVX instructions can work on 8 doubles at the same time. It can be 8/16/32/...

#define cell(i, j, w) ((i)*(w) + (j))
const int lineSize = 16; 
const int R = 3; // any integer
const int submatArea = (R+1)*(R+1);
const double submatAreaInv = double(1) / submatArea;
void subMatrixVarMulti(int64* S, int64* Sqr, int top, int left, int bot, int right, int w, int h, int diff, double submatAreaInv, double mean[lineSize], double var[lineSize])
{
  const int indexCache = cell(top, left, w),
        indexTopLeft = cell(top - 1, left - 1, w),
        indexTopRight = cell(top - 1, right, w),
        indexBotLeft = cell(bot, left - 1, w),
        indexBotRight = cell(bot, right, w);
  
  for (int i = 0; i < lineSize; i++) {
    mean[i] = (S[indexBotRight+i] - S[indexBotLeft+i] - S[indexTopRight+i] + S[indexTopLeft+i]) * submatAreaInv;
    var[i] = (Sqr[indexBotRight + i] - Sqr[indexBotLeft + i] - Sqr[indexTopRight + i] + Sqr[indexTopLeft + i]) * submatAreaInv
         - mean[i] * mean[i];
}

How can I optimize the above loop to have the highest possible speed? Readability doesn't matter. I heard it can be done using AVX2 and intrinsic functions, but I don't know how.

Edit: the CPU is i7-7700HQ, kabylake = skylake family

Edit 2: forgot to mention that lineSize, R, ... are already const

`int64`? AVX2 does have 64-bit integer subtract, but you'd have to synthesize 64-bit integer multiply out of 32-bit SIMD operations. This might still end up being worth it because of the amount of SIMD addition work you can do, but see [Fastest way to multiply an array of int64\_t?](https://stackoverflow.com/q/37296289) — Peter Cordes, Jul 30 '20 at 11:39
Oops, you're converting `int64_t` to `double` before multiply, so actually you need [How to efficiently perform double/int64 conversions with SSE/AVX?](https://stackoverflow.com/q/41144668) (No direct HW support for it until AVX-512; emulation can be efficient for numbers in the +-2^51 range with AVX2). int32_t would be more efficient, if your data could fit in that without overflow. Also `float` instead of `double` would save memory bandwidth... — Peter Cordes, Jul 30 '20 at 11:50
Unfortunately, squared values mean that it will overflow for 4K image and above. Not to mention float has terrible precision on anything larger than 2^23. @PeterCordes — Huy Le, Jul 30 '20 at 11:56
Oh I see, right you're subtracting two products, so catastrophic cancellation means you need to keep lots of precision to have any significant bits left. It's not obvious why 4k resolution would mean larger numbers in `S[]`, but presumably that comes from something in your use-case. You could speed it up significantly for smaller images by using narrower integers in cases where that is safe, though, e.g. having 2 versions of your function. (Twice as many elements per SIMD vector, and much simpler conversion to/from `double`.) — Peter Cordes, Jul 30 '20 at 12:05
Sqr[] contains squared values, so int only support images with 2^15 pixels or less. Can you write a comment showing how to do the addition/subtract part only? I will try the multiply part later. If possible, can you write the SIMD-ed version in the case everything is 32 bit int/float ? — Huy Le, Jul 30 '20 at 12:16
Is `Sqr[i] = S[i]*S[i]`? If so, you might want to save memory bandwidth by computing that on the fly. (e.g. convert int32 to double first and do FP subtraction for the variance line). I don't think it would be useful to vectorize the subtraction part without also vectorizing the conversion to `double`; extracting the elements of a SIMD vector costs instructions. Although possibly there'd still be a small gain? — Peter Cordes, Jul 30 '20 at 12:23

John Zwinck · Answer 1 · 2020-07-30T11:57:14.703

3

Your compiler can generate AVX/AVX2/AVX-512 instructions for you, but you need to:

Select the latest available architecture when compiling. For example with GCC you might say -march=skylake if you know your code will run on Skylake and later, but does not need to support older CPUs. Without this, AVX instructions cannot be generated.
Add restrict or __restrict to your pointer inputs to tell the compiler they do not overlap. This applies to S and Sqr, as well as mean and var (both pairs have the same type, so the compiler assumes they might overlap, but you know they do not).
Make sure your data is "over-aligned." For example if you want the compiler to use 256-bit AVX2 instructions, you should align your arrays to 256 bits. There are a few ways to do this, such as making a typedef with the alignment, or using alignas() or std::assume_aligned() (available as a GCC attribute prior to C++20). The point is you need the compiler to know that S, Sqr, mean and var are aligned to the largest SIMD vector size available on your target architecture, so that it does not have to generate as much fixup code.
Use constexpr where possible, such as lineSize.

Most importantly, profile to compare performance as you make changes, and look at the generated code (e.g. g++ -S) to see if it looks the way you want it to.

edited Jul 30 '20 at 11:57

answered Jul 30 '20 at 11:40

John Zwinck

239,568
38
324
436

Note that `restrict` is not a thing in standard C++. – eerorika Jul 30 '20 at 11:51
1

Splitting loops sounds like a terrible idea here. (Old version of this answer had that as point 5). You usually want *more* computational intensity (amount of ALU work for each time you load the data), not less. `mean[i+0..3]` will already be in a vector register, ready to be multiplied in parallel with doing the independent subtraction work! Fissioning these loops would cost more total instructions, and more cache traffic. (Or far worse, memory or L3 traffic if the arrays are large and don't fit in L2 cache.) – Peter Cordes Jul 30 '20 at 11:52
@PeterCordes: thanks for the feedback, I deleted what was point 5, about splitting the loop. – John Zwinck Jul 30 '20 at 11:54
1

You might need to manually vectorize this if you don't have AVX512, because packed int64_t <-> double is hard with only AVX2. And the most efficient way to emulate it only works for a limited range. (Which a compiler couldn't assume, so even if it did vectorize, it might be slower than what you could do by making that assumption) [How to efficiently perform double/int64 conversions with SSE/AVX?](https://stackoverflow.com/q/41144668). If you're lucky, you might get auto-vectorization that's close to full speed. – Peter Cordes Jul 30 '20 at 11:57
1

With the probably-unaligned `indexTopLeft` and so on, aligning the arrays probably doesn't matter. AVX doesn't require aligned vectors, and GCC8 and later, like clang, will normally auto-vectorize with unaligned SIMD loads instead of trying to reach an alignment boundary. (Aligned data is nice; it avoids cache-line splits in loads and stores, so that code-gen strategy is "optimistic": no overhead wasted on checking in the aligned case, and running full speed then. But if misaligned, let HW handle it fairly efficiently, which all AVX-capable hardware does.) – Peter Cordes Jul 30 '20 at 12:00
@JohnZwinck my array has size "lineSize", which equals to 512-bit if lineSize = 8. Is that considered aligned ? – Huy Le Jul 30 '20 at 12:04
@HuyĐứcLê: An array of eight 8-byte numbers by default will only be aligned to 8 bytes. What I'm saying is you can tell the compiler to align an array to a full 64 bytes so the entire array starts at an address which is a multiple of 64 instead of just a multiple of 8. This probably doesn't matter for S and Sqr for the reasons Peter said, but could still be worthwhile for mean and var. – John Zwinck Jul 30 '20 at 12:15

score 1 · Answer 2 · answered Jul 30 '20 at 12:18

1

I don't think you can perform efficiently this type of sum using SIMD due to the dependencies of the summation.

Instead you can do the computation differently which can be trivially optimized with SIMD:

Compute row-only partial summation. You parallelize it with SIMD by computing simultaneously for multiple rows.
Now with rows summed up, by computing cols-only partial summation to the output using the same SIMD optimization you obtain your desired Summed-Area Table.

You can do the same for both summation and summation of squares.

The only issue is you need extra memory and this type of computation requires more memory accesses. The extra memory is probably a minor thing but more memory access perhaps can be improved by storing the temporary data (the sums of rows) in a cache friendly manner. You'll probably need to experiment with this.

answered Jul 30 '20 at 12:18

ALX23z

4,456
1
11
18

This is actually my 7th improvement, after even multithreading. And I found that memory access is the overwhelming bottleneck. Also, how can I use SIMD for multiple rows at the same time ? Each value would be vertical, very far away in memory. – Huy Le Jul 30 '20 at 12:22
@HuyĐứcLê the memory issue is platform dependant. On a common laptop a single thread is capable of saturating memory but on a workstation you'll need quite a few threads to do so. – ALX23z Jul 30 '20 at 12:25
@HuyĐứcLê in case of this algorithm, you should be aware that cache line is 64 bytes. So whenever you load something nearby 64 bytes are upload high up in chache. Thus I believe that if rows are contiguous you can load several at same then compute efficiently for the whole cache. – ALX23z Jul 30 '20 at 12:28
@HuyĐứcLê for the temporary table you probably should store in format with mixed rows-cols data... think of something efficient for storing/loading using simd for both cols and rows (as long as it is done with correct alignment). – ALX23z Jul 30 '20 at 12:33
@ALX32z I will the idea. Also, I already use multithreading which definitely saturate the memory usage. – Huy Le Jul 30 '20 at 12:43

How to convert 3 addition and 1 multiply into vectorized SIMD using intrinsic functions C++

2 Answers2