Program in assembly x86

Question

I recently made a program with C++ and ASM. Can anyone help me make this code a more efficient one , in the ASM part or both. I would really appreciate it because i dont know every asm instriction and probably i am using way too many. BTW the program sums two integer vectors with any size. The code that i have is the one above:

C++:


extern "C" {
    int add_vtr_asm(int*, int*, int*, int);
}

void add_vtr() {

    __declspec(align(16))
        int vetor1[1024];
    __declspec(align(16))
        int vetor2[1024];
    __declspec(align(16))
        int soma[1024];
    for (i = 0; i <= 1023; i++) {
        vetor1[i] = i;
        vetor2[i] = i;
    }
    add_vtr_asm(vetor1, vetor2, soma, 1024);
    for (i = 0; i <= 1023; i++) {
        printf("% d + % d = % d \n",vetor1[i] ,vetor2[i], soma[i]);
     
    }
    exit(0);
}

int main()
{

    printf("Programa para somar vetores de inteiros: \n");
    printf("Soma de vetores com % d elementos \n", 1024);
    add_vtr();
}

ASM:
 

.MODEL FLAT, C  

.CODE             
add_vtr_asm PROC 
    push ebp 
    mov ebp,esp
    push esi 
    push edi 
    mov esi,[ebp+8] 
    mov ebx, [ebp+12]
    mov edi, [ebp+16]
    mov ecx,[ebp+20]
    shr ecx,2  
    next:movdqa XMM0,[esi]
    add esi,16
    paddd xmm0,[ebx]
    add ebx,16
    movdqa [edi],xmm0
    add edi,16
    dec ecx
    jnz next
    pop edi
    pop esi
    pop ebp
    ret
    add_vtr_asm ENDP
    END

I would benchmark both this and a pure C++ version (with full optimisation in both cases, of course). You may well find that you are wasting your time resorting to assembly code. — Paul Sanders, May 12 '22 at 21:13
If this code works and you're looking for input on style and efficiency you should probably ask on [codereview.se], not here. — Tangentially Perpendicular, May 12 '22 at 21:14
Yeah I know but my project is to compare this one with a pure C++ project. — Luis Barros, May 12 '22 at 21:14
OK. Have you actually benchmarked it yet? If you're seeking to improve on something you already have, you must first have a baseline to work from. — Paul Sanders, May 12 '22 at 21:15
You're looking for `#include ` so the compiler can inline `paddd` for you. Also, no need for non-portable syntax; `alignas(16) int vetor1[1024];` works just fine. Or let the compiler auto-vectorize for you; with an alignment guarantee and knowledge that the count is a multiple of 4 ints, it should make a loop at least as good with full optimization enabled. ([How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116)) — Peter Cordes, May 12 '22 at 21:29
Also, your asm violates the calling convention: you destroy the caller's EBX. Looks like EAX and EDX are unused in your code, so you only need to save/restore one register. — Peter Cordes, May 12 '22 at 21:30
If your target platform has a 32-bit integer, then the `alignas(16)` won't help. Let the compiler do it's job. Also, refrain from making large local arrays; if you want them local, prefix with `static`. — Thomas Matthews, May 12 '22 at 21:31
@ThomasMatthews: `alignas(16)` aligns by 16 **bytes** (or `char`s in ISO C++), exactly as much alignment as the OP's hand-written asm requires (SSE2 [`movdqa`](https://www.felixcloutier.com/x86/movdqa:vmovdqa32:vmovdqa64) and a memory source operand for `paddd`). Also, three arrays of 4x 1024 bytes is medium-sized at most, not truly *large*. It's a bit excessive for a non-leaf function that might be part of a deep call-tree with multiple functions using non-tiny local buffers, but not actually a problem in a toy program to run on Windows (1MiB stacks) or Linux (8MiB stacks by default). — Peter Cordes, May 12 '22 at 22:14

score 1 · Answer 1 · answered May 13 '22 at 16:14

1

I just coded it up in simple c++ and get this: https://godbolt.org/z/P1zPWv65b

struct Vec {
    alignas(16) int data[1024];
};

Vec add(const Vec &v1, const Vec &v2) {
    Vec v;
    for (int i = 0; i < 1024; ++i) {
        v.data[i] = v1.data[i] + v2.data[i];
    }
    return v;
}

Compiling with: g++ -std=c++20 -O2 -W -Wall

add(Vec const&, Vec const&):
        xor     eax, eax
.L2:
        movdqa  xmm0, XMMWORD PTR [rsi+rax]
        paddd   xmm0, XMMWORD PTR [rdx+rax]
        add     rax, 16
        movaps  XMMWORD PTR [rax-16+rdi], xmm0
        cmp     rax, 4096
        jne     .L2
        mov     rax, rdi
        ret

I don't see how I can improve that. That's not even vectorized, the plain optimization already picks SIMD just fine.

Note: without alignas(16) or std::array<int, 1024> the compiler uses movdqu which I assume might be slower. But I didn't test that. It's likely the whole code is limited by the memory bandwidth.

Adding -funroll-loops makes the loop do 128 bytes at a time using xmm0-xmm7 but you have to measure yourself if that is better.

The c/c++ code is much more readable and will work on any cpu. A simple vector addition is so simple for the compiler to optimize that writing it in asm can only make it worse.

answered May 13 '22 at 16:14

Goswin von Brederlow

11,875
2
24
42

*I don't see how I can improve that.* - I do, a small amount :P. The indexed addressing modes are a problem for front-end throughput on Sandybridge and Ivybridge ([Micro fusion and addressing modes](https://stackoverflow.com/q/26046634)). And for the port 7 store AGU on Haswell and later (before Ice Lake), which is necessary for max throughput of 2 loads + 1 store per clock. Unrolling 2 or 4 times to amortize a pointer increment would let you avoid indexed addressing modes for at least the store, optionally also the loads. (Haswell doesn't unlaminate legacy-SSE instructions.) – Peter Cordes May 13 '22 at 18:08
I mean, overall agreed and +1. Looking at and comparing to compiler output is definitely a good start, especially `-O3 -march=skylake` or whatever. But both GCC and clang unfortunately unroll without switching to pointer increments, even with `-march=sandybridge` where that costs them most of the front-end benefit of unrolling on that microarchitecture, same for Haswell/Skylake using AVX2 instructions which unlaminate there. More discussion about optimal asm might make more sense if the question had been posted on https://codereview.stackexchange.com/, though. – Peter Cordes May 13 '22 at 18:17
One fun trick to save `add` instructions is to index one array relative to another. Like `ptrdiff = a-b;` / increment `int *pb` pointing into b / `tmpa = *(pb + ptrdiff)`. (I'm not recommending doing it in C, just using it as pseudocode). So with 2 `add` instructions to increment pointers for the store and the `paddd` memory source operand, the `movdqa` load can still use an indexed addressing mode with no downside, like `movdqa xmm0, [v1-v2 + rsi]` where v1-v2 is in some other register. – Peter Cordes May 13 '22 at 18:22
That either requires the compiler to know the offset between the 2 arrays, e.g. they are declared in the same stackframe, and the difference is small enough to fit in the immediate of the opcode. Or you have to use an addressing mode with 2 registers `[#0 + eax*4 + rsi]`. My guess would be that the cost of incrementing the pointer to the second vector is 0 since the CPU will do that in parallel with the body of the loop. – Goswin von Brederlow May 13 '22 at 18:43
I meant that as pseudocode, to calculate the difference in a register (e.g. `sub rdi, rsi`) and actually use something like `movdqa xmm0, [rdi + rsi]` / `add rsi, 16`. The cost is never zero; any instruction will take up front-end bandwidth and space in the ROB so out-of-order exec can't see as far ahead, to do stuff like see loads into later pages to start resolving TLB misses, or to resolve the mispredict of the loop branch and start issuing code from whatever's after the loop. (For data that fits in cache, the loop trip-count isn't huge). And compete more/less with the other hyper-thread. – Peter Cordes May 13 '22 at 18:59
BTW, x86-64 addressing modes need all the registers to be the same size, and `#0` is ARM / AArch64 syntax. :P And the element sizes and strides are the same for all 3 arrays, so you wouldn't want to scale one register by 4 vs. another, unlike in https://gcc.gnu.org/bugzilla//show_bug.cgi?id=82369 where you actually do want `[rsi + rdi*2]`. That also shows full asm for a working example. Anyway yes, with sufficient unrolling to avoid a front-end bottleneck, throughput gain might only be from getting TLB miss and prefetch going in the next virtual page sooner via using less RS/ROB size. – Peter Cordes May 13 '22 at 19:05
And that is why you really don't want any ASM code you don't need. There are so many different syntaxes for different CPUs and then on top you have GNU and Intel syntax making things even more confusing. – Goswin von Brederlow May 13 '22 at 19:48
The point of messing around by hand in assembly is to figure out what's fast, so we can then teach compilers to do it. They're not AIs, most of the tricks they use had to come from humans. But yeah you don't actually want to write loops like this by hand as part of production code, except in a really rare case or for a hobby project. If you know assembly language, a couple more syntaxes are annoying but not a big deal if you deal with them frequently. – Peter Cordes May 13 '22 at 19:58
@PeterCordes Now wouldn't that actually be a good idea? Write a neural net or AI to learn how to optimize asm on its own. Wouldn't that be fun? – Goswin von Brederlow May 13 '22 at 20:02
Yeah, possibly interesting, especially as a final sanity-check on small loops to spot wasted instructions or alternate strategies. I'd certainly worry about compile-times being really high, though; compilers currently try to avoid algorithms with exponential complexity, which limits their ability to fully explore the possibilities but keeps compile-times hopefully reasonable even for corner cases of large functions. – Peter Cordes May 13 '22 at 20:06
I was thinking more like pattern recognition: If you see this intermediate construct output this asm code. Where the asm code would be something learned once. The intermediate constructs could be something an AI could pick too by comparing all the code on github to find repeating patterns or so. So two separate AI projects. :) – Goswin von Brederlow May 13 '22 at 20:09

Program in assembly x86

1 Answers1