1

I was given the following task:

Given two arrays with 16 elements: NIZA RESW 16 and NIZB RESW 16
store in the third array (NIZC RESW 16) the following values: NIZC[i]=NIZA[i]+NIZB[i] using MMX instructions and compiling it with NASM

This is what I got so far:

%include "apicall.inc"
%include "print.inc"

segment .data
unos1 db "Array A: ", 0
unos2 db "Array B: ", 0
ispisC db "Array C : ", 0

segment .bss
  NIZA RESW 16
  NIZB RESW 16
  NIZC RESW 16

segment .text
global start

start:
call init_console
mov esi,0
mov ecx, 16

mov eax, unos1
call print_string
call print_nl

unos_a:
call read_int
mov [NIZA+esi], eax
add esi, 2
loop unos_a

mov esi,0
mov ecx, 16
mov eax, unos2
call print_string
call print_nl

unos_b:
call read_int
mov [NIZB+esi], eax
add esi, 2
loop unos_b

movq mm0, qword [NIZA]
movq mm1, qword [NIZB]
paddq mm0, mm1
movq qword [NIZC], mm0


mov esi,NIZC
mov ecx,16
mov eax, ispisC
call print_string
call print_nl

ispis_c:
mov ax, [esi]
movsx eax, ax
call print_int
call print_nl
add esi, 2
loop ispis_c

APICALL ExitProcess, 0

After compiling the given array, and testing it with the following two arrays, the third array only stores 4 elements out of 16. (given in the following picture)

Screenshot

Does anybody know why it only stores 4 elements out of 16? Any help is appreciated.
If you have any question for the functions print_string print_int print_nl are functions for printing out a string, new line and a integer by pushing it in the EAX register, and also note this is a 32-bit program.

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
  • 2
    MMX, huh? The late 90's called, they have some Pets.com stock to sell you. SSE and its successors are pretty much universally available these days. – Nate Eldredge Nov 24 '21 at 18:00
  • 3
    I only see one store to `NIZC`, and it doesn't seem to be inside a loop, so it seems natural that you are only going to store 4 elements (64 bits). – Nate Eldredge Nov 24 '21 at 18:04
  • 1
    Also `mov [NIZA+esi], eax` seems to be doing a 32-bit store of what should be a 16-bit value. And I see calls to `print_string`, `print_nl`, etc inside your loops; are you sure they preserve ecx, esi and any other relevant registers? Finally, [the `loop` instruction is a habit you don't want to develop](https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently). – Nate Eldredge Nov 24 '21 at 18:06
  • 2
    Single-stepping with your debugger would probably be very helpful in understanding what your code does and doesn't do. Do you know how to do that? If not, now is the time to learn. – Nate Eldredge Nov 24 '21 at 18:07
  • About `print_string` and `print_nl` preserve the registers, and you can see the code here, https://pastebin.com/ZtTn6RU7, I tried summing the two arrays using only mov and incrementing, and it worked perfectly – Tarik Pašić Nov 24 '21 at 18:21
  • The other thing, I am not that experienced with MMX, how would you loop and store it in the array `NIZC` – Tarik Pašić Nov 24 '21 at 18:21
  • Much like you do in your `unos_b` loop. Keep a register with an offset, say `edx`, starting at 0; load from `[NIZA+edx]` and `[NIZB+edx]`, store to `[NIZC+edx]`, add 8 to `edx`, and loop back and do it again, 4 times. Or use your loop counter like `ecx` and use a scaled addressing mode: load from `[NIZA+ecx*8]` etc, and iterate ecx from 0 to 3. – Nate Eldredge Nov 24 '21 at 18:25
  • That solved my problem, thank you very much! – Tarik Pašić Nov 24 '21 at 18:38
  • `paddq` adds 64-bit elements. (One per MMX register, two per XMM register). If you actually have arrays of words, use `paddw`. Also, you can use a memory source operand, MMX and AVX don't require alignment for memory source operands, like `paddw mm0, [edx]` / `add edx, 8`. – Peter Cordes Nov 24 '21 at 23:34

1 Answers1

3

Does anybody know why it only stores 4 elements out of 16?

Because you let your MMX instructions only operate on the first 4 array elements. You need a loop to process all 16 array elements.

Your task description doesn't say it, but I see you sign-extend the values from NIZC before printing, so you seem expecting signed results. I also see that you use PADDQ to operate on 4 word-sized inputs. This will then not always give correct results! eg. If NIZA[0]=-1 and NIZB[0]=5, then you will get NIZC[0]=4 but there will have happened a carry from the first word into the second word, leaving NIZC[1] wrong. This will not happen if you use the right version of the packed addition: PADDW.

You got lucky with the size errors on mov [NIZA+esi], eax and mov [NIZB+esi], eax. Because NIZA and NIZB follow each other in memory in the same order that you assign to them, no harm was done. If NIZB would have been placed before NIZA, then assigning NIZB[15] would have corrupted NIZA[0].

Below is a partial rewrite where I used an input subroutine in order to not have to repeat myself.

    mov   eax, unos1
    mov   ebx, NIZA
    call  MyInput
    mov   eax, unos2
    mov   ebx, NIZB
    call  MyInput

    xor   esi, esi
more:
    movq  mm0, qword [NIZA + esi]
    paddw mm0, qword [NIZB + esi]
    movq  qword [NIZC + esi], mm0
    add   esi, 8
    cmp   esi, 32
    jb    more
    emms                     ; (*)

    ...


MyInput:
    call  print_string
    call  print_nl
    xor   esi, esi
  .more:
    call  read_int           ; -> EAX
    mov   [ebx + esi], ax
    add   esi, 2
    cmp   esi, 32            ; Repeat 16 times
    jb    .more
    ret

(*) For info about emms (Empty MMX State) see https://www.felixcloutier.com/x86/emms

Tip: You can write mov ax, [esi] movsx eax, ax in one instruction: movsx eax, word [esi].

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
  • For such a tiny loop, you could fully unroll it, especially if you switch to using XMM registers so it's only two `paddw xmm0, [NIZB]` / ... / `paddw xmm1, [NIZB+16]` assuming you can align your arrays by 16. CPUs with MMX but not SSE2 range from Pentium-MMX through Pentium-III, and AMD up to Athlon-XP. x86-64 guarantees that SSE2 is available, and and some 32-bit CPUs like Pentium-4 have it. (I know comments already pointed out that MMX is obsolete, but it's worth saying in any MMX answer in 2021!) – Peter Cordes Nov 25 '21 at 01:08