Issues while manually increasing address of a pointer with inline assembly (gcc)

Question

my apologies for the rookie question. I am diving into assembly a bit more and I am trying to use some inline asm on a small library I wrote to do linear algebra. Everything works fine but I am having issues increasing the address pointed by the pointers. I need this step to be able to multiply elements of a matrix.

void multiply(int n, int m, int* ptrM1, int* ptrM2, int* result)
{
    int counter = n*n*m;
    int index = 1;
    int inter = 0;
    while (index != (counter+1))
    {
        asm 
        (
            "movl %2, %%eax\n\t"
            "mull %3\n\t"
            "movl %%eax, %0\n\t"
            "addl %0, %1\n\t"
            "addq $0x04, (%2)\n\t"      //<-----Problematic lines
            "addq $0x04, (%3)\n\t"      //<-----Problematic lines
            :"+c"(inter),"+b"(*result)
            :"r"(*ptrM1), "r"(*ptrM2)
            :"%eax"
        );
        printf("%d  ", *result);
        //++ptrM1;                      //Want to do this in assembly
        //++ptrM2;                      //Same
        if ((index % m ) == 0){++result;}
        ++index;
    }
}

The program compiles as it is, but I get a core dumped (segmentation fault) error when I try to run it. I suspect it has to do with the syntax I am using in the lines I commented above. I am also suspicious about the constraints I gave to the two variables since they both perform as input and output but are only declared as inputs. The thing is, I literally copied how gcc handled the "++pointer_address" in the disassembled code. Anybody that can help me out? It would be much appreciated.

EDIT: Excellent tips in the comments so far, but still don't fix the problem. Here is how I implemented them:

void multiply(int n, int m, int* ptrM1, int* ptrM2, int* result)
{
    int counter = n*n*m;
    int index = 1;
    int inter = 0;
    while (index != (counter+1))
    {
        asm 
        (
            "movl %4, %%eax\n\t"
            "mull %5\n\t"
            "movl %%eax, %0\n\t"
            "addl %0, %1\n\t"
            "addq $0x04, %2\n\t" //Problematic lines
            "addq $0x04, %3\n\t" //problematic lines
            :"+c"(inter),"+b"(*result), "+rm"(ptrM1), "+rm"(ptrM2)
            :"r"(*ptrM1), "r"(*ptrM2)
            :"%eax"
        );
        printf("%d  ", *result);
        //++ptrM1;                  //Want to do this in assemlby
        //++ptrM2;                  //Same
        if ((index % m ) == 0){++result;}
        ++index;
    }
}

EDIT 2: And this is the original function written in C, for comparison.

void multiply(int n, int m, int* ptrM1, int* ptrM2, int* result)

{
    int counter = n*n*m;
    int index = 1;
    int inter = 0;
    while (index != (counter+1))
    {
        if ((index % m ) == 0)
        {
            inter = *ptrM1 * *ptrM2;
            *result += inter;
            ++ptrM1;
            ++ptrM2;
            ++result;
        }
        else
        {
            inter = *ptrM1 * *ptrM2;
            *result += inter;
            ++ptrM1;
            ++ptrM2;
        }
        ++index;
    }
}

You probably need an early clobber. I strongly recommend not using inline assembly for this sort of thing. There's nothing in there that needs it and if you want to practice assembly programming, write the whole code in assembly in an assembly file. If you have to use inline assembly for some reason, try to split it up into statements that are as short as possible, ideally one instruction per statement. You should be able to get rid of all data moves, too. — fuz, Jul 25 '22 at 19:00
Thanks for the comment, I will try an early clobber. You are right, asm is not really needed, I am practicing some asm "reverse engineering" my C code. I still plan to rewrite the whole library in assembly. Cheers. — Fulvio, Jul 25 '22 at 19:05
You use `*ptrM1` which dereferences, giving you an `int` that you then try to use as a pointer. That won't work. If you want to do `++ptrM1` you do not need to dereference, not in the operand an not in the asm. What you will need is an output of course. — Jester, Jul 25 '22 at 19:05
To increment the pointer you want a `"+rm" (ptrM1)` in the output section and an `addq $4, %foo` (duplicate for the other one). — Jester, Jul 25 '22 at 19:16
I tried to implement this changes, but I still get segmentation fault. I am going to edit my question so that you can see the new code. — Fulvio, Jul 25 '22 at 19:28
@Fulvio I don't see any early clobbers in there. You also don't have a clobber for `edx`, so your `mull` instruction kills some random variable. — fuz, Jul 25 '22 at 19:36
As far as I can remember my linear algebra you need to move one pointer through the columns and the other through rows. You don't seem to do that. Also your use of the `inter` is confusing. — Jester, Jul 25 '22 at 19:40
@fuz you're right I tried but it's not very clear to me where to put the early clobbers (by which I assume you mean the '&' token in the constrains of the outputs). Also, why does ```edx``` need a clobber? I am not directly referring to it. From what I understood ```mull``` operates on ```eax``. Could you maybe show me how you mean? — Fulvio, Jul 25 '22 at 19:45
@Jester I am using another method. The main idea is to have the user work with an "array of arrays" of dimensions n and m. To this function you see I feed two "array of arrays" converted in strings of numbers that "map" to each other and are multiplied/accumulated m times. The goal is to have a final string of elements that can be mapped into the resulting nxn matrix. As far as the logic of the program goes it works in C and also with the inline assembly if I keep increasing the pointer's memory in C. Inter is an intermediate variable I use to store the result. Without the inline code breaks. — Fulvio, Jul 25 '22 at 19:52
Would help if you showed the original code in C. Clearly the problem is not with the `++ptrM1` anymore. — Jester, Jul 25 '22 at 19:55
@Fulvo You are using the single operand form of `mull`. This multiplies `eax` with its operand, leaving a 64 bit result in `edx:eax`. Refer to an instruction set reference for details. — fuz, Jul 25 '22 at 19:56
@Fulvio As for the early clobbers: all registers that are overwritten before all input operands are read need an early clobber. — fuz, Jul 25 '22 at 19:57
See [Looping over arrays with inline assembly](https://stackoverflow.com/q/34244185) for two ways of approaching it: use `"r"` with a pointer and using an addressing mode inside the inline asm, or using an `"m"` operand and letting the compiler pick an addressing mode. You're using `"r"(*int_ptr)` and then also using `(%2)` inside the asm, so you're dereferencing a 32-bit int. — Peter Cordes, Jul 25 '22 at 19:57
Also, instead of dealing with `mul` clobbering EDX, just use `imul %5, %%eax`, or let the compiler pick an output register operand. If you're not going to use the high-half output of `mul`, use a non-widening multiply. (https://www.felixcloutier.com/x86/imul) — Peter Cordes, Jul 25 '22 at 19:59
@fuz aha! I forgot about that even if I remember reading it somewhere. Does it overwrite ```edx``` even if the result doesn't need all 64 bits? @Jester I will share it in a moment. — Fulvio, Jul 25 '22 at 19:59
The 64-bit result of `mul` always fills all 64 bits. It wouldn't be very useful if 64-bit math left garbage in the high half for small inputs. Read the manual (https://www.felixcloutier.com/x86/mul) and/or try it yourself single-stepping in a debugger. (**A debugger is also a good way to see how the compiler filled in your asm template string**, to understand what overall asm is actually running.) — Peter Cordes, Jul 25 '22 at 20:00
@Fulvio Of course it always overwrite both `edx` and `eax`. There are very few instructions that only some times overwrite the destination registers an `mul` is not one of them. Note that if you want to have a regular 32 × 32 → 32 multiplication, you should simply use the `imul r, r/m` instruction. It also works for unsigned numbers! — fuz, Jul 25 '22 at 20:01
@fuz it worked! Funny how this particular was breaking the whole thing. But in retrospect I also learned how to properly reference pointers and memory in asm. Thanks! — Fulvio, Jul 25 '22 at 20:05
@fulvio - Like the earlier comments say, you are not learning assembly here, you are learning *gcc inline assembly*, which has a myriad of odd rules and conventions - **totally** different from the rules you need to follow when writing assembly in a separate asm file. Instead of learning the processor instructions, you now know some secret incantations for the compiler, that a separate assembler doesn't use. So, you are trying out some chinese as preparation for writing a book in greek? :-) — BoP, Jul 25 '22 at 20:09
@Fulvio It's not very surprising. Overwriting random registers is something that generally leads to bad results. — fuz, Jul 25 '22 at 20:11

Fulvio · Answer 1 · 2022-07-25T21:31:22.270

1

So thanks to everybody who commented. The final working function is the following:

void multiply(int n, int m, int* ptrM1, int* ptrM2, int* result)
{
    int counter = n*n*m;
    int index = 1;
    int inter = 0;
    while (index != (counter+1))
    {
        asm 
        (
            "movl %4, %%eax\n\t"
            "mull %5\n\t"
            "movl %%eax, %0\n\t"
            "addl %0, %1\n\t"
            "addq $0x04, %2\n\t" 
            "addq $0x04, %3\n\t" 
            :"+c"(inter),"+b"(*result), "+rm"(ptrM1), "+rm"(ptrM2)
            :"r"(*ptrM1), "r"(*ptrM2)
            :"%eax", "%edx"
        );

        if ((index % m ) == 0){++result;}
        ++index;
    }
}

So to summarize, I was both using the pointer wrong and not clobbering a cluttered register. The solution consisted in adding the de-deferenced pointers into the outputs with "+rm" constrains, and clubbering both eax and edx since mull will store a 64 bit result in edx:eax.

edited Jul 25 '22 at 21:31

answered Jul 25 '22 at 20:12

Fulvio

31
6

1

If you used `imul`, you wouldn't need EAX and EDX clobbers. Just `mov %4, %0` / `imul %5, %0`. Also, it's weird to do the pointer increments inside the asm statement, but have the compiler do the dereferencing. You *can* do that, but it's not obvious why you'd want to nail down this part of the asm but not the whole loop. – Peter Cordes Jul 25 '22 at 20:48
@PeterCordes That's a good observation. I am working top down on this function in order to have it in pure assembly. I first "translated" the purely arithmetical part and I managed to debug it myself. Then I focused on the pointers, but couldn't get that part without your help :) now I will slowly work my way into the rest (index increase, if statement etc.). Looking forward to lose some sanity with conditional jumps and flags. The goal is to have a function to which you pass C arguments and uses 100% assembly logic. Is that so backwards?! LoL It helps me understand both C and asm.... – Fulvio Jul 25 '22 at 21:28
1

I'd start by looking at optimized compiler output for the function, with compiler options to not go too crazy. e.g. `-O2 -fno-unroll-loops -fno-tree-vectorize`. Or perhaps `-O1` or `-Og` minimal optimization. (`-O0` is cluttered mess of stores/reloads, unless you use `register int foo` for a bunch of variables.) See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) and put your code on https://godbolt.org/ to help match source lines to asm instructions. The compiler knows sane ways to do things, e.g. multiply integers. – Peter Cordes Jul 25 '22 at 21:32
@PeterCordes thanks for the tip and all the useful links! :) – Fulvio Jul 25 '22 at 21:33
Also, flattening your loop like that with an `if( ... % m)` instead of using nested loops is probably going to compile a lot less efficiently. You have to check the modulo every iteration, instead of just checking whether the next modulo comes before the end of the outer loop. And since you case did `counter = n*n*m`, you don't need a `min()` or something, you can just separate the loop conditions so the inner loop is only checking a counter running up to `sometime + m`, or from 0 to ` – Peter Cordes Jul 25 '22 at 21:36
Is that actually more efficient at compilation? Because from an algorithmic point of view right now the ```while``` loop increases in complexity as O(n), with a nested ```for``` it would be O(n^2). Meaning that an increase in the dimensions of the matrices will exponentially increase the time needed for the computation. If gcc is inefficient in this regard, maybe I can find a better way in asm? – Fulvio Jul 25 '22 at 22:00
1

Your current implementation has `O(n*n*m)` complexity, as it runs the loop body that many times. And the loop body is O(1), running inline asm and checking something `%m`. IDK how you're getting that down to `O(n)`. Maybe you're forgetting what `n` is? It's a different `n` if it comes from squaring some input vs. not. A straightforward matmul on a square matrix is O(N^3) work over O(N^2) data, so yes, the amount of work does increase *cubically* (not exponentially). – Peter Cordes Jul 25 '22 at 22:03
@PeterCordes thanks for pointing out my ignorance. It is not a case, in fact, that I am a hobbist programmer. Have a nice day sir. – Fulvio Jul 25 '22 at 22:06
There *are* algorithms that do less than O(N^3) work, and are potentially worth using on *large* matrices, despite their higher constant factors. For example https://en.wikipedia.org/wiki/Strassen_algorithm . Similarly for BigInt multiplication there's [Karatsuba](https://en.wikipedia.org/wiki/Karatsuba_algorithm), and other later advances, of which [only a couple are useful for BigInt problems of a practical size on real computers.](https://mattermodeling.stackexchange.com/questions/1355/did-the-2019-discovery-of-on-logn-multiplication-have-a-practical-outcome) – Peter Cordes Jul 25 '22 at 22:07
@PeterCordes, I did check out godbolt.com and the other links you shared. That's a hell of a tool, it really helped me understand how the compiler passes variables to functions and how it references them in the function body. I changed my plan, instead of working my way with inline assembly I am going to write directly a small library in ASM for linear algebra utilities. My background is in natural sciences so up to now programming was modelling and plotting in Python - if not even in Matlab. I expected both C and ASM to be waaay harder. Thanks for the patience. – Fulvio Jul 26 '22 at 19:34
Sounds like a fun hobby project as a way to learn some asm. If you actually want good performance, plain C, or C with SIMD intrinsics, will likely be easier to write, and easier to modify to play around with cache-blocking optimization. SIMD often works well for linear algebra problems, processing 16 or 32 bytes at once (e.g. 8x int32_t). – Peter Cordes Jul 26 '22 at 19:41
Some methods introduced [here](https://codegolf.stackexchange.com/q/248007/108859) might help if you plan to make this faster. An answer there fixed a subtle point I missed initially making it double speed. To make it even faster especially for larger matrices, you have to split the matrix to fit in cache blocks, but now you have to read some serious papers.. – xiver77 Jul 27 '22 at 02:12

Issues while manually increasing address of a pointer with inline assembly (gcc)

1 Answers1

Linked