1

I'm implementing 3*3 matrix and 3*1 matrix multiplication using RISC-V assembly language, with GNU C inline asm.

// description: matrix multiply with two-level for loop

#include<stdio.h>
int main()
{
        int f,i=0;
        int h[9]={0}, x[3]={0}, y[3]={0};
        FILE *input = fopen("../input/3.txt","r");
        for(i = 0; i<9; i++) fscanf(input, "%d", &h[i]);
        for(i = 0; i<3; i++) fscanf(input, "%d", &x[i]);
        for(i = 0; i<3; i++) fscanf(input, "%d", &y[i]);
        fclose(input);
        int *p_x = &x[0] ;
        int *p_h = &h[0] ;
        int *p_y = &y[0] ;

        for (i = 0 ; i < 3; i++)
        {

                p_x = &x[0] ;
        /*
         for (f = 0 ; f < 3; f++)
             *p_y += *p_h++ * *p_x++ ;
        */

                for (f = 0 ; f < 3; f++){
                   asm volatile (
                    "addi t0, zero, 2\n\t"
                    "bne t0, %[f], Multi\n\t"
                    "mul t1, %[sp_h], %[sp_x]\n\t"
                    "add %[sp_y], %[sp_y], t1\n\t"
                    "addi %[p_h], %[p_h], 4\n\t"
                    "addi %[p_x], %[p_x], 4\n\t"
                    "addi %[p_y], %[p_y], 4\n\t"
                    "beq zero, zero, Exit\n\t"
                    "Multi: \n\t"
                    "mul t1, %[sp_h], %[sp_x]\n\t"
                    "add %[sp_y], %[sp_y], t1\n\t"
                    "addi %[p_h], %[p_h], 4\n\t"
                    "addi %[p_x], %[p_x], 4\n\t"
                    "Exit: \n\t"
                    :[p_y] "+&r"(p_y),
                     [p_x] "+&r"(p_x),
                     [p_h] "+&r"(p_h),
                     [sp_y] "+&r"(*p_y)
                    :[sp_h] "r"(*p_h),
                     [sp_x] "r"(*p_x),
                     [f] "r"(f)
                    :"t0","t1"

                );
            printf("x value=%d, h value=%d, y value=%d, y address=%d\n", *p_x, *p_h, *p_y, p_y);
        }

        }

        p_y = &y[0];

        for(i = 0; i<3; i++)
                printf("%d \n", *p_y++);

        return(0) ;

}

I want to transfer this comment

for (f = 0 ; f < 3; f++) 
    *p_y += *p_h++ * *p_x++ ;
p_y++;

into asm volatile(...), but above code in asm I encounter this problem: my problem It seems that p_y address is added correct, and my multiplication is correct, but my value store into memory is wrong. It will store into memory too fast, and now my answer adds previous answer together. Above code answer is 5 28 69 Could anybody help me? I have edited the "+r" to "+&r" and add the clobbers to the code, but it doesn't work.

And by the way, i get right answer from this code:

for (f = 0 ; f < 3; f++)
                    asm volatile ("mul %[tmp], %[sp_h], %[sp_x]\n\t"
                                  "add %[sp_y], %[sp_y], %[tmp]\n\t"
                                  "addi %[p_x], %[p_x], 4\n\t"
                                  "addi %[p_h], %[p_h], 4\n\t"
                                  :[tmp] "=r"(tmp),
                                   [p_x] "+r"(p_x),
                                   [p_h] "+r"(p_h),
                                   [sp_y] "+r"(*p_y)
                                  :[sp_h] "r"(*p_h),
                                   [sp_x] "r"(*p_x)
                                );
            p_y++;

I expect

14 
32 
50

for the answer. And here is input data:

1 2 3 4 5 6 7 8 9
1 2 3
0 0 0
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Justdoit
  • 13
  • 4
  • For starters you forgot to declare clobbers on `"t0", "t1"` which you use in the asm template. So that's undefined behaviour, and might be stepping on the compiler's toes; it might have picked the same register for any of your "r"` or `"+r"` constraints; check the compiler-generated asm. Also, your `"+r"` in/out operands should probably be `"+&r"` early-clobber since you probably write some of them before the last read of a pure input. – Peter Cordes Apr 07 '23 at 01:05
  • If there's any other bug remaining after you fix that, edit with an update and it can maybe get reopened. But that undefined behaviour is a total showstopper. You can look at the compiler-generated asm locally with `gcc -S`, or on https://godbolt.org/ ([How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116)). – Peter Cordes Apr 07 '23 at 01:15
  • It still gives the same numerical output after fixing those bugs? Your `printf` of `*p_x` etc. is using pointers after the inline asm incremented them past the ends of the array or row, isn't it? Or if the branching inside the asm supposed to not increment one of the pointers in the final iteration? Why is there so much duplicated code, instead of just jumping over the one instruction you don't want to do. (Or am I missing something and the two sides of the `f!=2` branch are different). Use a debugger to single-step the asm (including the compiler-generated asm). – Peter Cordes Apr 07 '23 at 03:20
  • @PeterCordes thank you for answering my question, i edited my question and it has the same result. I think it maybe the problem you said above (it might have picked the same register for any of your "r"` or "+r" constraints), but I don't know how to fix this problem (pick the same register). Is there any solution for this? – Justdoit Apr 07 '23 at 03:31
  • `"r"` inputs definitely won't pick the same register as any `"+&r"` outputs; you correctly solved that problem, along with the `"t0","t1"` clobber list. Any remaining bugs are not from using inline asm wrong, not that I can see. – Peter Cordes Apr 07 '23 at 03:38
  • Your original program showed some context, but you've removed that. I assume that means the plain C version does give correct results in whatever surrounding code you have, with whatever types `p_x` and `p_h` have now. Before it was odd that they were both `int *` and incrementing by 1 element for a matmul where the `p_y` doesn't change inside the inner loop. Normally you need a row x column dot-product to get an output element, so one of the pointers needs to stride down a column. (Which is very inefficient for larger matrices, bad access pattern for cache.) – Peter Cordes Apr 07 '23 at 03:44
  • @PeterCordes yes, it still the same numerical output after fixing those bugs and i will try to correct my dup code. If i have any problem, i will edited my post again. Thank you! – Justdoit Apr 07 '23 at 03:47
  • Do you get the correct output if you actually use `*p_y += *p_h++ * *p_x++ ;` for testing? If not, then the problem wasn't your asm, it was your surrounding code or that loop logic. – Peter Cordes Apr 07 '23 at 03:52
  • @PeterCordes I get the expect output from using `*p_y += *p_h++ * *p_x++ ;` – Justdoit Apr 07 '23 at 04:00
  • When you single-step in a debugger, are the pointers incrementing correctly for the first 2 `f` iterations vs. the 3rd, i.e. only incrementing `p_y` for the `f == 2` iteration? – Peter Cordes Apr 07 '23 at 04:14
  • For the `[sp_y] "+&r"(*p_y)` output, does it get loaded with the old `p_y` but stored with the new `p_y`? That would be a problem. Your code isn't currently a [mcve], there isn't surrounding code I could copy/paste into a compiler (e.g. on https://godbolt.org/) to look at the compiler-generated code. I'm not sure if the `*p_y` gets re-evaluated as a C expression after the asm statement, or if it binds to one object. You could try using a temporary variable, like `int sum = 0;` (or `int sum = *p_y;`) ahead of the loop, with a `"+&r"(tmp)` in/out operand that doesn't involve a changing pointer – Peter Cordes Apr 07 '23 at 04:20
  • Oh my bad, it is once again complete, I'll have a look... Well almost complete, it doesn't compile due to a syntax error (a stray `,`). https://godbolt.org/z/76fEr1d55 – Peter Cordes Apr 07 '23 at 04:26
  • Yeah, https://godbolt.org/z/bG75WdnTs shows `"+&r"(*p_y)` uses the updated `p_y` pointer. You don't want it to load/store inside the loop anyway, so either use a temp variable or use a separate `asm` statement for the final increment; don't do that increment before (or unsequenced with) when the compiler gets the output sum. – Peter Cordes Apr 07 '23 at 04:33
  • Your update with a simplified asm statement and a separate `p_y++;` was the clue I needed to start thinking about there being something wrong with how operands were getting in/out of the more complex asm statement. So nice work debugging and narrowing down the problem. Surprisingly, this is a problem I've never seen before in 8 years of reading every new [assembly] question on Stack Overflow. – Peter Cordes Apr 07 '23 at 05:44
  • @PeterCordes Thank you for your help! Your explaination is very clear. I learn a lot. Thank you very much! – Justdoit Apr 07 '23 at 05:48

1 Answers1

0

Your [sp_y] "+&r"(*p_y) input/output uses the original p_y to read the input, but it uses the p_y updated by the asm statement (in the f==2 case) to store the result. The compiler-generated asm from gcc -O1 -Wall (on Godbolt) shows this nicely:

# top of inner loop
.L4:
        lw      a1,0(s3)
        lw      a0,0(s2)
        lw      a3,0(s1)      # "+r"(*p_y) in/out operand in a3
        mv      a4,s1         # "+r"(p_y)  in a4
        mv      a5,s2
        mv      a2,s3
# your asm starts here
        addi t0, zero, 2
        bne t0, s0, Multi
        mul t1, a1, a0
        add a3, a3, t1        # sum += stuff
        addi a2, a2, 4
        addi a5, a5, 4
        addi a4, a4, 4
        beq zero, zero, Exit
Multi: 
        mul t1, a1, a0
        add a3, a3, t1       # sum += stuff in the other branch
        addi a2, a2, 4
        addi a5, a5, 4
Exit: 

# your asm ends here, compiler now has to move regs to C objects
# (with more optimization, it might have just used s1 as [p_y] avoiding two mv)
        mv      s1,a4
        mv      s2,a5
        mv      s3,a2
        sw      a3,0(a4)    # store new sum using the new p_y
        lw      a2,0(a2)
        lw      a1,0(a5)
        addi    a0,s5,%lo(.LC3)
        call    printf

GCC's docs don't mention what happens when one output modifies a C variable that's also part of lvalue expression for another output, at least not that I noticed. I'd assume the outputs are unsequenced wrt. each other, in which case it's undefined behaviour similar to x = ++x + ++x;, or at least unspecified exactly what happens.

So don't do that. Don't conditionally increment p_y inside the inner loop, so don't make p_y an output (or input) to the inner-loop asm statement at all.

I don't think there's any way to make that fully well-defined, although "=r"(p_y[f == 2 ? -1 : 0]) and a matching constraint like "0"(*p_y) would probably always use the updated p_y from the other output register. But then you're manually doing an offset in the C to balance what the asm does, totally defeating the purpose. So use a separate asm statement outside the inner loop to do the pointer increment in the outer loop.

A separate int sum = *p_y; variable would be normal for this kind of thing, but you'd still have to assign *p_y = sum; before incrementing p_y, so you still couldn't do it inside the inner-loop asm statement. You don't want the compiler to sw / lw the sum inside the inner loop like you're encouraging it to do with *p_y.

So it would look like

  int sum = *p_y;      // asm lw
  for (int f = 0 ; f < 3; f++) {
      asm("..." 
          : "+&r"(sum), ...  // not including p_y
          : [sp_h] "r"(*p_h),  [sp_x] "r"(*p_x)  // same as before
          : // no clobbers, you don't need to compare/branch inside the asm
      );
  }
  *p_y = sum;          // asm sw
  asm("addi %0, %0, 4" : "+r"(p_y));  // p_y++;

Or write the whole outer-loop body (including the inner loop and the p_y++) in asm, including your own lw and sw, so you can manually put the addi %[p_y], %[p_y], 4 after an sw that uses %[p_y].

(Don't forget a "memory" clobber for an asm statement with pointer inputs that reads/writes memory that isn't an "m" constraint.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847