3

I'm hoping to speed up this matrix-vector product using AVX-1 or earlier instructions:

    // a is an array N columns of length M each
    // b is length N
    // c is length M
    //
    // M % 32 == N % 32 == 0
    // all memory is nicely aligned and unaliased
    
    void mat_vec_prod(const int8_t** a, const uint8_t* b, int16_t* c) {
        for(int i = 0; i < M; ++i) {
            c[i] = 0;
            for(int j = 0; j < N; ++j)
                c[i] += int16_t(a[j][i]) * int16_t(b[j]);
        }
    }

(I'm aware that swapping the loops is worth considering)

The intrinsics _mm_maddubs_epi16 and _mm_maddubs_pi16 could help with the uint8 x int8 dot product, but, in my case, the matrix has the awkward layout, where it's an array of pointers to columns (instead of rows).

One possibility is to load 8x8 patches of a, and then transpose and multiply them by segments of b. (I found this thread on 8x8 byte matrix transpose). However, this would have to use _mm_maddubs_pi16, which has only half the throughput of _mm_maddubs_epi16.

My question is: is it worth trying to load and transpose 16x16 patches instead, or will I run out of xmm registers? What should my strategy be here?

MWB
  • 11,740
  • 6
  • 46
  • 91
  • Can you use AVX2 instructions? Some people mean AVX/AVX2 when they say AVX. – fuz Sep 28 '21 at 09:38
  • 2
    @fuz No, just AVX. Please edit, if there is a way to make this clearer. – MWB Sep 28 '21 at 09:39
  • Uh... optimising for Sandy Bridge or this one AMD generation is annoying. Let's see... – fuz Sep 28 '21 at 09:55
  • 4
    I would `[v]punpcklbw` two vectors of `a[j][i:i+7]` and `a[j+1][i:i+7]` and multiply that using `[v]pmaddubsw` by `{b[j], b[j+1],b[j], b[j+1],...}` (then accumulate these results). Depending on the actual sizes you'll need some sort of blocking (i.e., don't reload the `c` subvector for every addition.) Creating the `b`-multiplier is a bit annoying without `vpbroadcastw`, I guess a `[v]pshuflw` followed by a `[v]pshufd` should do the trick. You should reuse the `b`-multiplier for as many registers you have for accumulating `c`. – chtz Sep 28 '21 at 13:11
  • *where it's an array of pointers* - Do you need to use a ragged array-of-pointers for some reason, like to make swapping whole columns with each other efficient? Using contiguous storage for the whole matrix doesn't change the fundamental difficulty of SIMD loads only being able to get multiple elements in one direction (rows in your case), but at least indexing between columns is just adding a stride, not loading a different pointer. – Peter Cordes Sep 28 '21 at 16:16
  • @PeterCordes *"Do you need to use a ragged array-of-pointers for some reason, like to make swapping whole columns with each other efficient?"* Yes. I need to use an array-of-pointers (or an equivalent "contiguous array + index into it") for efficiency reasons. I could *create* a contiguous array, before every function call, at the cost of copying `M x N` bytes out of `a`. – MWB Sep 28 '21 at 16:53
  • Obviously I was implying that you might or might not be able to change the other uses of this data structure to use a contiguous array. If you can't or won't, then making a copy just for this function's use is worse than the extra indirection as you traverse it. As I said, it wouldn't make anything fundamentally easier for SIMD. – Peter Cordes Sep 28 '21 at 17:49
  • One thing pops out: what is the _real_ range of the inputs? I'm just asking because if the number of rows is a multiple of 32, then the 16 bit accumulator overflows quite rapidly, unless the multiplier is something like -4..4, or even -2..2 for 64 rows (or if `a` is constrained). – Aki Suihkonen Sep 29 '21 at 05:49
  • @AkiSuihkonen In my case, the input is constrained (in a complicated way) that an overflow in the result or intermediate values should not occur. – MWB Sep 29 '21 at 06:16
  • 1
    @chtz I implemented your suggestion (at least as I understood it). 6x faster than the naive version, probably pushing against the memory bandwidth. Many thanks! – MWB Sep 29 '21 at 21:06

1 Answers1

2

I'd go with the interleaving approach suggested by chtz.

Read 32 or 64 bytes (aka a full cache line) from two rows, then interleave.

32 bytes at least, as the width of each row % 32 == 0, and preferably 64 bytes, as that is a full cache line and it would take 8 accumulators out of 16 registers.

Also I would guess that processing the input as blocks of (8, 16, or 32 rows) by (32 or 64 columns) would be better than processing all the rows; the more rows you process, the less you need to spill the accumulators to memory, with more rows processed in non-linear order the higher the probability of evicting soon to be needed lines from cache. 4 rows should be definitively on safe side.

Interleaving b is quite naturally done by

auto b0to7 = _mm_unpacklo_epi16(b,b);
auto b8tof = _mm_unpackhi_epi16(b,b);
auto b01 = _mm_shuffle_epi32(b0to7, 0x00);
auto b23 = _mm_shuffle_epi32(b0to7, 0x55);
...
auto bef = _mm_shuffle_epi32(b8tof, 0xff);

Another possibility of splitting the input to even/odd sequences would need 4 arithmetic instructions per 16 bytes, or 8 instructions per 32 bytes:

// common terms
auto b_even = _mm_set1_epi16(b[j] & 0x00ff);
auto b_odd = _mm_set1_epi16(b[j] * 256);
// per 16 input bytes in `a`
auto mul_even = _mm_maddubs_epi16(row_j, b_even);
auto mul_odd = _mm_maddubs_epi16(row_j, b_odd);
sum_even = _mm_add_epi16(sum_even, mul_even);
sum_odd = _mm_add_epi16(mul_odd, mul_even);

This is clearly not as tight as

auto prod_lo = _mm_unpacklo_epi8(row_j, row_jplus1);
auto prod_hi = _mm_unpackhi_epi8(row_j, row_jplus1);
prod_lo = _mm_maddubs_epi16(prod_lo, b01);
prod_hi = _mm_maddubs_epi16(prod_hi, b01);
sum_lo = _mm_add_epi16(sum_lo, prod_lo);
sum_hi = _mm_add_epi16(sum_hi, prod_hi);

but the shuffles can only execute on Port5 where as 2 mul/adds can start every cycle. They are probably quite close in performance.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • 1
    On SandyBridge/IvyBridge (only Intel CPU with AVX1 only) unpacking/shuffling (as well as `paddw`) seems to happen on either `p1` or `p5`, but `pmaddubs` is limited to `p0`. But in any case, unpacking, `pmaddubs` and `paddw` should distribute nicely over the available ports also on newer CPUs (if AVX2 is available, throughput could of course be doubled, with some additional overhead of unshuffling upper and lower halves) – chtz Sep 30 '21 at 13:49