2

I am developing high performance algorithms based on the Intel instruction sets (AVX, FMA, ...). My algorithms (my kernels) are working pretty well when the data is stored sequentially. However, now I am facing a big problem and I didn't find a work around or solution for it: see 2D Matrix

int x, y; x = y = 4096;
float data[x*y]__attribute__((aligned(32)));
float buffer[y]__attribute__((aligned(32)));

/* simple test data */ 
for (i = 0; i < x; i++)
    for (j = 0; j < y; j++)
        data[y*i+j] = y*i+j; // 0,1,2,3...4095, | 4096,4097, ... 8191 |...

/* 1) Extract the columns out of matrix */
__m256i vindex; __m256 vec;
    vindex = _mm256_set_epi32(7*y, 6*y, 5*y, 4*y, 3*y, 2*y, y, 0);


 for(i = 0; i < x; i+=8)
 {
   vec = _mm256_i32gather_ps (&data[i*y], vindex, 4);
         _mm256_store_ps (buffer[i], vec);
 }

/* 2) Perform functions */
fft(buffer, x) ;

/*3) write back buffer into matrix*/
/* strided write??? ...*/

I want to find a very efficient way to do the following:

  1. Extract the columns out of the matrix: col1 = 0, 4096, 8192, ... col2 = 1, 4097, 8193, ... I tried it with gather_ps which is really slow.

  2. Perform my high efficient algorithms on the extracted columns...

  3. Store back the columns into the matrix:

Is there any special trick for that? How can you read and write the with stride (e.g. 4096) using the Intel instructions sets?

Or is there any memory manipulation option the get the columns out of the matrix?

Thank you!

Maron
  • 21
  • 3
  • matmul is one of the most heavily-used computational operation in scientific computing, and has been studied extensively. Usually you should use an existing optimized implementation (like Eigen or an optimized BLAS library) instead of trying to write your own, unless you want to do other work on the fly at the same time to increase computational intensity. Cache-blocking is key for matrices that don't fit in L1d cache. [How does BLAS get such extreme performance?](https://stackoverflow.com/q/1303182) and [What Every Programmer Should Know About Memory?](https://stackoverflow.com/a/47714514) – Peter Cordes Jan 24 '19 at 16:44
  • Thank you Peter for your answer. Matmul is not the operation I am looking for. I have to perform operations only on the columns of the matrix such as FFT (Azimuth Compression). As far as I know, BLAS does not offer such functions. I was using FFTW in wisdom mode but I wasn't satisfied. Therefore, I decided to develop my own FFT for a special purpose. – Maron Jan 24 '19 at 16:59
  • Oh, I see, I think @nominal's matmul example got me thinking that. Can you run 4 or 8 columns in parallel? (Or better, 16 so you're accessing a full cache line of columns in each row). Then you never need to shuffle within a vector, everything is pure vertical and you have consecutive elements of each column in different vectors. – Peter Cordes Jan 24 '19 at 17:11

3 Answers3

6

[For row-major data, SIMD access to a row is fast, but to a column slow]

Yes, that is the nature of the x86-64 and similar architectures. Accessing consecutive data in memory is fast, but accessing scattered data (whether randomly or in a regular pattern) is slow. It is a consequence of having processor caches.

There are two basic approaches: copy the data to a new order that facilitates better access patterns, or you do the computations in an order that allows better access patterns.

No, there are no rules of thumb or golden tricks that makes all just work. In fact, even comparing different implementations is tricky, because there are so many complex interactions (from cache latencies to operation interleaving, to cache and memory access patterns), that results are heavily dependent on particular hardware and the dataset at hand.


Let's look at the typical example case, matrix-matrix multiplication. Let's say we multiply two 5×5 matrices (c = a × b), using standard C row-major data order:

c00 c01 c02 c03 c04   a00 a01 a02 a03 a04   b00 b01 b02 b03 b04
c05 c06 c07 c08 c09   a05 a06 a07 a08 a09   b05 b06 b07 b08 b09
c10 c11 c12 c13 c14 = a10 a11 a12 a13 a14 × b10 b11 b12 b13 b14
c15 c16 c17 c18 c19   a15 a16 a17 a18 a19   b15 b16 b17 b18 b19
c20 c21 c22 c23 c24   a20 a21 a22 a23 a24   b20 b21 b22 b23 b24

If we write the result as vertical SIMD vector registers with five components, we have

c00   a00   b00   a01   b05   a02   b10   a03   b15   a04   b20
c01   a00   b01   a01   b06   a02   b11   a03   b16   a04   b21
c02 = a00 × b02 + a01 × b07 + a02 × b12 + a03 × b17 + a04 × b22
c03   a00   b03   a01   b08   a02   b13   a03   b18   a04   b23
c04   a00   b04   a01   b09   a02   b14   a03   b19   a04   b24

c05   a05   b00   a06   b05   a07   b10   a08   b15   a09   b20
c06   a05   b01   a06   b06   a07   b11   a08   b16   a09   b21
c07 = a05 × b02 + a06 × b07 + a07 × b12 + a08 × b17 + a09 × b22
c08   a05   b03   a06   b08   a07   b13   a08   b18   a09   b23
c09   a05   b04   a06   b09   a07   b14   a08   b19   a09   b24

and so on. In other words, if c has the same order as b, we can use SIMD registers with consecutive memory contents for both c and b, and only gather a. Furthermore, the SIMD registers for a have all components the same value.

Note, however, that the b registers repeat for all five rows of c. So, it might be better to initialize c to zero, then do the additions with products having the same b SIMD registers:

c00    a00   b00   c05    a05   b00   c10    a10   b00   c15    a15   b00   c20    a20   b00
c01    a00   b01   c06    a05   b01   c11    a10   b01   c16    a15   b01   c21    a20   b01
c02 += a00 × b02,  c07 += a05 × b02,  c12 += a10 × b02,  c17 += a15 × b02,  c22 += a20 × b02
c03    a00 × b03   c08    a05   b03   c13    a10   b03   c18    a15   b03   c23    a20   b03
c04    a00 × b04   c09    a05   b04   c14    a10   b04   c19    a15   b04   c24    a20   b04

If we transposed a first, then the SIMD vector registers for a would also get values from consecutive memory locations. In fact, if a is large enough, linearizing the memory access pattern for a too gives large enough speed boost that it is faster to do a transpose copy (using uint32_t for floats, and uint64_t for doubles; i.e. not using SIMD or floating point at all for the transpose, but just copying the storage in transpose order).

Note that the situation with column-major data order, i.e. data order transposed compared to above, is very similar. There is deep symmetry here. For example, if c and b have the same data order, and a the opposite data order, you can SIMD-vectorize the matrix product efficiently without having to copy any data. Only the summing differs, as that depends on the data order, and matrix multiplication is not commutative (a×b != b×a).

Obviously, a major wrinkle is that the SIMD vector registers have a fixed size, so instead of using a complete row as a register as in the example above, you can only use partial rows. (If the number of columns in the result is not a multiple of the SIMD register width, you have that partial vector to worry about as well.)

SSE and AVX have a relatively large number of registers (8, 16, or 32, depending on the set of extensions used), and depending on the particular processor type, might be able to perform some vector operations simultaneously, or at least with fewer latencies if unrelated vector operations are interleaved. So, even the choice of how wide a chunk to operate at once, and whether that chunk is like an extended vector, or more like a block submatrix, is up to discussion, testing, and comparison.

So how do I do the matrix-matrix multiplication most efficiently using SIMD?

Like I said, that depends on the data set. No easy answers, I'm afraid.

The main parameters (to choosing the most efficient approach) are the sizes and memory ordering of the multiplicand and result matrices.

It gets even more interesting, if you calculate the product of more than two matrices of different sizes. This is because the number of operations then depends on the order of the products.


Why are you so discouraging?

I'm not, actually. All of the above means that not too many people can handle this kind of complexity and stay sane and productive, so there is a lot of undiscovered approaches, and a lot to gain in real-world performance.

Even if we ignore the SIMD intrinsics compilers provide (<x86intrin.h> in this case), we can apply the logic above when designing internal data structures, so that the C compiler we use has the best opportunities for vectorizing the calculations for us. (They're not very good at it yet, though. Like I said, complex stuff. Some like Fortran better than C, because its expressions and rules make it easier for Fortran compilers to optimize and vectorize them.)

If this was simple or easy, the solutions would be well known by now. But they aren't, because this is not. But that does not mean that this is impossible or out of our reach; all it means is that smart enough developers haven't yet put enough effort into this to unravel this.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • Thank you very much for you detailed answer! I carefully read your statements. Since I am developing high performance algorithms based on SIMD for years I am aware of the problems when it comes to non sequential data access. For my algorithms I have the restriction to not copy the data but to perform all the operations in place to save memory. Transposing a matrix in advance is also not an option. Maybe one thing about my timings: If I execute my functions in x direction (sequential data) of a 2D 4k x 4k matrix it takes 0.046 seconds. In y direction 1.03 seconds which is insane. – Maron Jan 24 '19 at 15:08
  • @Maron: Carefully examine your restrictions; do not accept nontechnical ones, as this is engineering, not politics. When limited by memory, various blocking approaches can be used. (That is, you only copy a small block that you work on at a time.) As an example, if you split a matrix to 3×3 submatrices, you can work on them in order 11,12,21,31,22,13,23,32,33, with pretty good data locality, regardless of the data order. As to the timings, yep; that's what cache misses and being bottlenecked on RAM-to-cache bandwidth with an unfortunate access pattern gives you. – Nominal Animal Jan 24 '19 at 16:09
  • The problem is when for example performing a fft on the columns it not so easy to do it blockwise. Since I have RAM limitations I cannot just copy the matrix and transpose it. However, my idea was to extract the columns by using intel's gather_ps. It works fine but it takes too much time. I was wondering if I could find another solution by asking it here. – Maron Jan 24 '19 at 16:53
  • @Maron: FFT is just a faster way to compute a discrete Fourier transform, and DFT itself is essentially `dft[k] = sum(signal[n] * C(k,n,N), n=0..N-1)`. Nobody forces you to calculate dft[0] before dft[1]; you can do `dft[k] += signal[n] * C(k, n, N)` in any order you like, as long as you start with `dft` all zeroes, and do it for all k and all n from 0 to N-1, inclusive. It sounds to me like you need to dive in deeper to the task at hand, and look at the operations your chosen algorithms need to do. :) – Nominal Animal Jan 24 '19 at 18:16
  • Thank you for this hint. I am using the radix algorithm in order calculate the FFT. I have implemented the algorithm in C and it works really well and the performance is more than satisfying. Actually, my whole master's thesis was about DFT and it's possible implementation. Therefore, I really know what I want to do :) However, I will take your ideas into account for the next steps. – Maron Jan 24 '19 at 19:35
  • @Maron: Yes - also note I try to write my responses not only for you, but to all those that read this answer and exchange later on, perhaps working on a similar/related problem. Anyway, my point was that if the data order is particularly unfortunate (like in your 0.046s to 1.03s case), switching to a weaker/slower algorithm that allows better access patterns, may provide a significant boost in overall performance. This is what I mean by different facets of the computer hardware and algorithm implementation details interacting, and can be quite counterintuitive. – Nominal Animal Jan 24 '19 at 20:24
2

If you can run your algorithms over 8 (or 161) columns in parallel, one regular AVX load can grab 8 columns of data into one vector. Then another load can grab the next row from all those columns.

This has the advantage that you never need to shuffle within a vector; everything is pure vertical and you have consecutive elements of each column in different vectors.

If this was a reduction like summing a column, you'd produce 8 results in parallel. If you're updating the columns as you go, then you're writing vectors of results for 8 columns at once, instead of a vector of 8 elements of one column.

Footnote 1: 16 float columns = 64 byte = 1 full cache line = two AVX vectors or one AVX512 vector. Reading / writing full cache lines at a time is a lot better than striding down one column at a time, although it is usually worse than accessing consecutive cache lines. Especially if your stride is larger than a 4k page, HW prefetching might not lock on to it very well.

Obviously make sure your data is aligned by 64 for this, with the row stride a multiple of 64 bytes, too. Pad the ends of rows if you need to.

Doing only 1 AVX vector (half a cache line) at a time would be bad if the first row will be evicted from L1d before you loop back to read the 2nd 32-byte vector of columns 8..15.


Other caveats:

4k aliasing can be a problem: A store and then a load from addresses that are a multiple of 4kiB apart aren't detected as non-overlapping right away, so the load is blocked by the store. This can massively reduce the amount of parallelism the CPU can exploit.

4k strides can also lead to conflict misses in the cache, if you're touching lots of lines that alias to the same set. So updating data in place might still have cache misses for the stores, because lines could be evicted after loading and processing, before the store is ready to commit. This is most likely to be a problem if your row stride is a large power of 2. If that ends up being a problem, maybe allocate more memory in that case and pad your rows with unused elements at the end, so the storage format never has a large power of 2 row stride.

Adjacent-line prefetching in L2 cache (Intel CPUs) may try to fill in the pair of every line you touch, if there's spare bandwidth. This could end up evicting useful data, especially if you're close to aliasing and/or L2 capacity. But if you're not pushing into those limits, it's probably a good thing and will help when you loop over the next 16 columns.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you Peter for the interesting ideas! Let's say we have a 4096x4096 matrix. My function needs e.g. the first element, the 1024th, the 2048th and the 3072th of the first column for it's calculation. So it is not possible to load it in a way you mentioned, as far as I understood you correctly. I will consider and analyse your caveats for the next steps. – Maron Jan 24 '19 at 19:46
0

The Data should be stored one line after the other in the memory. Since C does not really care if it is an array or a matrix you can access the elements with

for(int i=0;i<columncount;i++) data[i*LENGTH + desired_column];

you can now store the data or even better the addreses to give it to your worker function. If you take the addreses the values in the matrix will change so you don't need to write them back.

Brueni
  • 53
  • 7
  • Thank you Brueni. Unfortunately, this is not the solution for my problem. I want to use intel instructions like vec = _mm256_load_ps(float *data). But the problem is that the columns are not sequentially stored in the memory. Therefore, I wanted to know if there an option for strided read and write like intel’s gather_ps. Or is there any other memory manipulation trick? – Maron Jan 24 '19 at 12:18
  • 1
    @Maron: This answer is suggesting that you transpose one of your inputs so columns *are* sequential in memory in that matrix. – Peter Cordes Jan 24 '19 at 16:46