AVX2 Matrix 4x4 multiplication not working

Question

This is a 4x4 matrix multiplication program using AVX2. But the program is not displaying the output. Please see where is the problem and do i have to do anything for memory alignment or not? Please suggest.

#include <iostream>
#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>

using namespace std;

void m4x4_AVX(float *A, float *B, float *C) {
    __m256 row1 = _mm256_load_ps(&B[0]);
    __m256 row2 = _mm256_load_ps(&B[4]);
    __m256 row3 = _mm256_load_ps(&B[8]);
    __m256 row4 = _mm256_load_ps(&B[12]);
    for(int i=0; i<4; i++) {
        __m256 brod1 = _mm256_set1_ps(A[4*i + 0]);
        __m256 brod2 = _mm256_set1_ps(A[4*i + 1]);
        __m256 brod3 = _mm256_set1_ps(A[4*i + 2]);
        __m256 brod4 = _mm256_set1_ps(A[4*i + 3]);
        __m256 row = _mm256_add_ps(
                    _mm256_add_ps(
                        _mm256_mul_ps(brod1, row1),
                        _mm256_mul_ps(brod2, row2)),
                    _mm256_add_ps(
                        _mm256_mul_ps(brod3, row3),
                        _mm256_mul_ps(brod4, row4)));
        _mm256_store_ps(&C[4*i], row);
        printf("%f\t", C[4*i]);

    }
}

int main(){
    float a[4][4] = {{1,2,3,4},
                     {5,6,7,8},
                     {9,10,11,12},
                     {9,10,11,12}
                    };

    float b[4][4] = {{5,2,3,4},
                     {6,6,7,8},
                     {7,10,11,12},
                     {8,10,11,12}
                    };

   float c[4][4];

   m4x4_AVX((float*)&a, (float*)&b , (float*)&c);

   return 0;
}

If you don't want to mess with alignment you should use: _mm256_loadu_p.a. and _mm256_storeu_ps, note the 'u' after load and store this means unaligned. — Jonas, Nov 20 '16 at 11:13
Why does this use AVX? The rows only have 4 floats in them you know — harold, Nov 20 '16 at 11:51
Of course in principle, doing too much math isn't a correctness problem. But it also changes the alignment requirements of aligned store/load to 32, so even if the matrix is sufficiently aligned, the second row won't be. — harold, Nov 20 '16 at 12:32
so would you kindly tell me what's your suggestions in this case and what do i do to run this code? — K.Malu, Nov 20 '16 at 12:38
Best suggestion: use SSE intrinsics instead of AVX (`s/__m256/__m128/g` and `s/_mm256/_mm/g`). Second best: use unaligned loads for the misaligned data. — Paul R, Nov 20 '16 at 15:53
BTW, it looks like you just took the code from [this answer](http://stackoverflow.com/a/18508113/253056) (without any attribution) and then blindly changed all the SSE types and intrinsics to AVX, without really understanding how it works. I hope you can understand that this is pointless both as far as writing efficient code is is concerned, and also as a learning exercise - did you think that this would make the code twice as fast or something ? — Paul R, Nov 20 '16 at 16:16
@PaulR: haha, well spotted. Same indenting and everything. I guess it's probably not worth all the shuffling to broadcast two adjacent floats into the lower/upper lanes of a YMM, and then the extra VEXTRACTF128 + 128-bit add. Intel's [AVX for small matrices](https://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices) only uses 8x8 multiply as an example. Hmm, I think you could do the broadcast-load with VBROADCASTSD, then an in-lane shuffle to broadcast element 0 in the low lane and element 1 in the high lane (VPERMILPS with a shuffle-constant vector). — Peter Cordes, Nov 21 '16 at 06:55
@PaulR: Or load two rows of the matrix that needs transposing, and use AVX2 VPERMPS with 4 different constants to get the shuffling done. Still not sure that's cheap enough to be worth it, since AVX VBROADCASTSS from memory doesn't require the shuffle port at all on Intel CPUs. — Peter Cordes, Nov 21 '16 at 06:58
@PeterCordes: I've had very mixed results with AVX to date - the two main issues seem to be (a) split lane hassles, and the extra instructions needed to work around these and (b) the subset of AVX instructions which seem to have half the throughout of their SSE equivalents. There are some simple operations where you can get 2x over SSE, a general case where you get maybe 1.3x to 1.7x, and then the worst case, where you see no benefit at all. I suspect the above 4x4 matrix multiplication may fall into the latter of these three. — Paul R, Nov 21 '16 at 07:05
@PaulR: Yeah, I was really glad to see that AVX512 mostly drops the "in-lane" behaviour that makes AVX1 so horrible for anything that needs shuffling. Even AVX2 seems not bad with VPERMD / VPERMQ, but I haven't timed much stuff for real. But most instructions are supposed to have full throughput in 128b mode, on Intel CPUs. It can take tens of thousands of cycles for the CPU to decide to power up the upper halves of the execution units, though, so there is a warm-up period. Did you have any specific examples of instructions where you see lower throughput on say Sandybridge or Haswell? — Peter Cordes, Nov 21 '16 at 07:41
Ah - you've got me now - it's been a few months since I last did some AVX work and noticed the problem with throughput on some instructions, and I can't remember which ones caused the problem. It's even possible that I have conflated this with AVX-512 on KNL, since I've been doing some work in that area too. I'll do a bit more digging when I've had coffee and woken up a bit... ;-) — Paul R, Nov 21 '16 at 08:53
@PaulR: BTW, your 1.3x to 1.7x sounds very reasonable for problems that aren't pure vertical, but still worth doing some extra shuffling to use 256b ops. (Or for vertical ops where a memory bottleneck becomes a bigger factor). But it's certainly fun when AVX2 gives perfectly linear speedups, like this ["fastest way to generate 1GB of random ASCII digits, 100 per line"](http://unix.stackexchange.com/questions/323845/whats-the-fastest-way-to-generate-a-1-gb-file-containing-only-random-numbers/324520#324520) question on U&L.SE :) — Peter Cordes, Nov 21 '16 at 09:04
@PaulR: You weren't thinking of DIVPS were you? That's about the only instruction that Agner Fog lists as having less throughput for 256b vectors for Haswell, and it is about a factor of two throughput on HSW. There are some where the YMM version becomes lane-crossing, making it 3c latency instead of 1 (e.g. PMOVZX, CVTPS2PD), but with the same throughput. Latency can affect loop throughput, but you were talking about the same kind of measurement as Agner Fog's tables, right? Repeating just one instruction without other stuff stealing cycles from it, or other vagaries of OOO execution. — Peter Cordes, Nov 21 '16 at 09:11
Could be, but I'm afraid that even after a caffeine jolt my memory is not too good - I just remember converting something from SSE to AVX only to find that it was exactly the same speed as before, and then looking up the latency/throughput and seeing that the AVX equivalent was indeed 2x slower. However this could have been IB and AVX floating point stuff, rather than Haswell, or it could have been some AVX-512 stuff I did a month or so back on KNL - I've been scanning my commit logs for clues but haven't come up with anything... — Paul R, Nov 21 '16 at 09:37
@PaulR: I haven't looked at anything for KNL. Even first-gen SnB had full-width execution units for everything it supports, except DIV/SQRT where throughputs are half (like they still are on SKL). Oh, I just noticed even VRSQRTPS ymm has half throughput on pre-SKL! Anyway, the other major possibility is IvB/SnB only have half-throughput 256b loads/stores, including VMASKMOVPS. I didn't notice anything else when flipping through Agner's table for IvB. Don't worry about it unless you're like me and can't stop until you satisfy your curiosity once you get started on something :P — Peter Cordes, Nov 21 '16 at 11:05
Heh - yes, this going to be bugging me for a while now. I get to work on a lot of different projects and it can be hard keeping track of all the details. It will probably come back to me at some random time in the future (and by that time I'll have trouble finding this comment thread! ;-)). — Paul R, Nov 21 '16 at 11:11

AVX2 Matrix 4x4 multiplication not working

0 Answers0