1

I am trying to write a function that will fill my float matrix with zeros using ymm registers.

After not a long time I wrote this function:

void fillMatrixByZeros(float matrix[N][N]){
    for (int k = 0; k < N; k += 8){
        for (int i = 0; i < N; ++i){
            asm volatile (
                "vxorps %%ymm0, %%ymm0, %%ymm0;"
                "vmovups %%ymm0, (%0)"
                : "=m"(matrix[i] + k)
                : 
                : "%ymm0", "memory"
            );
        }
    }
}

I tried to compile my whole code and I got this error:

prog.cpp: In function ‘void fillMatrixByZeros(float (*)[16])’:
prog.cpp:35:8: error: lvalue required in asm statement
   35 |       );
      |        ^
prog.cpp:35:8: error: invalid lvalue in asm output 0

I made a conclusion that matrix[i]+k is a rvalue or something like, so it can't be used there.

After googling, I came up with two solutions:

First:

void fillMatrixByZeros(float matrix[N][N]){
    for (int k = 0; k < N; k += 8){
        for (int i = 0; i < N; ++i){
            asm volatile (
                "vxorps %%ymm0, %%ymm0, %%ymm0;"
                "vmovups %%ymm0, (%0)"
                : 
                : "r"(matrix[i] + k)
                : "%ymm0", "memory"
            );
        }
    }
}

Second:

void fillMatrixByZeros(float matrix[N][N]){
    long long int matrixPointer;
    for (int k = 0; k < N; k += 8){
        for (int i = 0; i < N; ++i){
            asm volatile (
                "vxorps %%ymm0, %%ymm0, %%ymm0;"
                "vmovups %%ymm0, (%0)"
                : "=r"(matrixPointer)
                : "0"(matrix[i] + k)
                : "%ymm0", "memory"
            );
        }
    }
}

These functions work correctly. And I want to know why.

Why there are no any lvalue problems in first function? And what is going on in the second function?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
lyect
  • 13
  • 2

1 Answers1

2

You cannot assign to matrix[i] + k, so it is not an lvalue. The m constraint expects an object in memory, not its address. So to fix this, supply the object you want to assign to instead of its address:

void fillMatrixByZeros(float matrix[N][N]){
    for (int k = 0; k < N; k += 8){
        for (int i = 0; i < N; ++i){
            asm volatile (
                "vxorps %%ymm0, %%ymm0, %%ymm0;"
                "vmovups %%ymm0, %0"
                : "=m"(matrix[i][k])
                : 
                : "%ymm0", "memory"
            );
        }
    }
}

This is the correct way to access objects in memory in an inline assembly statement.

The solutions using an r constraint with the address for the operand and then doing an explicit dereference work, too. But they are likely less efficient because they prevent the compiler from using some other addressing mode, like a SIB addressing mode. Instead it has to first materialise the address in a register.

Your last example is a bit silly. It uses coupled asm operands to essentially perform matrixPointer = matrix[i] + k before passing that to the inline assembly statement. This is a pretty roundabout way to do it and not at all needed.

That said, for further efficiency you should hoist the clearing of ymm0 out of the loop. Something like this perhaps?

#include <immintrin.h>

#define N 1000

void fillMatrixByZeros(float matrix[N][N]){
    for (int k = 0; k < N; k += 8){
        for (int i = 0; i < N; ++i){
            asm volatile (
                "vmovups %1, %0"
                : "=m"(matrix[i][k])
                : "x"(_mm256_setzero_ps())
                : "memory"
            );
        }
    }
}

Note that just calling memset is likely to perform a lot better than hand-rolled inline assembly.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • You can remove the memory clobber, since the memory operand is specified correctly. – prl Nov 18 '21 at 21:45
  • @prl No because more than a single array member is written. – fuz Nov 18 '21 at 21:47
  • Right, I missed the k += 8. – prl Nov 18 '21 at 21:48
  • 1
    This could avoid a memory clobber with `"=m"( *(__m256)(&matrix[i][k]) )` to tell the compiler that `matrix[i][k + 0..7]` is the output memory operand, @prl. Or cast to a pointer-to-array-of-8-float with `"=m" ( *(float (*)[8])(&matrix[i][k])` as shown in [declare memory \*pointed\* to by an inline ASM argument used?](//stackoverflow.com/q/56432259). Although at that point you might as well just fully use intrinsics because this is really silly. Especially using `asm volatile`, unless this is for "security" reasons to make sure memory is zeroed even if those stores are dead. – Peter Cordes Nov 19 '21 at 03:44
  • 2
    Oh wow, holy crap I just realized that this is incrementing `k` in the *outer* loop, so not clearing contiguous memory, instead striding down columns. https://godbolt.org/z/xsrrPojT6 Yes, definitely use memset, but also learn how C arrays work and why sequential access matters. [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) – Peter Cordes Nov 19 '21 at 03:46