I have recently been learning about SIMD in assembly (x86_64), and had some unexpected results. What it comes down to is the following.
I have two programs that run through a loop a number of times. The first program contains a loop that executes 4 SIMD instructions, the second contains this exact same loop with one extra instruction. The codes look like this:
The first program:
section .bss
doublestorage: resb 8
section .text
global _start
_start:
mov rax, 0x0000000100000001
mov [doublestorage], rax
cvtpi2pd xmm1, [doublestorage]
cvtpi2pd xmm2, [doublestorage]
cvtpi2pd xmm3, [doublestorage]
cvtpi2pd xmm4, [doublestorage]
cvtpi2pd xmm5, [doublestorage]
cvtpi2pd xmm6, [doublestorage]
cvtpi2pd xmm7, [doublestorage]
mov rax, (1 << 31)
loop:
movupd xmm1, xmm3
movupd xmm2, xmm5
divpd xmm1, xmm2
addpd xmm4, xmm1
dec rax
jnz loop
mov rax, 60
mov rdi, 0
syscall
The second program:
section .bss
doublestorage: resb 8
section .text
global _start
_start:
mov rax, 0x0000000100000001
mov [doublestorage], rax
cvtpi2pd xmm1, [doublestorage]
cvtpi2pd xmm2, [doublestorage]
cvtpi2pd xmm3, [doublestorage]
cvtpi2pd xmm4, [doublestorage]
cvtpi2pd xmm5, [doublestorage]
cvtpi2pd xmm6, [doublestorage]
cvtpi2pd xmm7, [doublestorage]
mov rax, (1 << 31)
loop:
movupd xmm1, xmm3
movupd xmm2, xmm5
divpd xmm1, xmm2
addpd xmm4, xmm1
movupd xmm6, xmm7
dec rax
jnz loop
mov rax, 60
mov rdi, 0
syscall
Now, my line of thought was the following: the second program has more instructions to execute, so it will take considerably longer to execute. If I time both programs, though, the second program turns out to take less time to complete than the first program. I ran both programs a total number of 100 times, and the results are:
Runtime first program: mean: 5.6129 s, standard deviation: 0.0156 s
Runtime second program: mean: 5.5056 s, standard deviation: 0.0147 s
I conclude that the second program runs considerably faster. These results seem counterintuitive to me, so I was wondering what could be the reason for this behavior.
To be complete, I am running Ubuntu 15.10 and the NASM compiler (-elf64) and using an Intel Core i7-5600. Also, I checked the disassembly and no optimizations had been made by the compiler:
Objdump of the first program:
exec/instr4: file format elf64-x86-64
Disassembly of section .text:
00000000004000b0 <.text>:
4000b0: 48 b8 01 00 00 00 01 movabs $0x100000001,%rax
4000b7: 00 00 00
4000ba: 48 89 04 25 28 01 60 mov %rax,0x600128
4000c1: 00
4000c2: 66 0f 2a 0c 25 28 01 cvtpi2pd 0x600128,%xmm1
4000c9: 60 00
4000cb: 66 0f 2a 14 25 28 01 cvtpi2pd 0x600128,%xmm2
4000d2: 60 00
4000d4: 66 0f 2a 1c 25 28 01 cvtpi2pd 0x600128,%xmm3
4000db: 60 00
4000dd: 66 0f 2a 24 25 28 01 cvtpi2pd 0x600128,%xmm4
4000e4: 60 00
4000e6: 66 0f 2a 2c 25 28 01 cvtpi2pd 0x600128,%xmm5
4000ed: 60 00
4000ef: 66 0f 2a 34 25 28 01 cvtpi2pd 0x600128,%xmm6
4000f6: 60 00
4000f8: 66 0f 2a 3c 25 28 01 cvtpi2pd 0x600128,%xmm7
4000ff: 60 00
400101: b8 00 00 00 80 mov $0x80000000,%eax
400106: 66 0f 10 cb movupd %xmm3,%xmm1
40010a: 66 0f 10 d5 movupd %xmm5,%xmm2
40010e: 66 0f 5e ca divpd %xmm2,%xmm1
400112: 66 0f 58 e1 addpd %xmm1,%xmm4
400116: 48 ff c8 dec %rax
400119: 75 eb jne 0x400106
40011b: b8 3c 00 00 00 mov $0x3c,%eax
400120: bf 00 00 00 00 mov $0x0,%edi
400125: 0f 05 syscall
Objdump of the second program:
exec/instr5: file format elf64-x86-64
Disassembly of section .text:
00000000004000b0 <.text>:
4000b0: 48 b8 01 00 00 00 01 movabs $0x100000001,%rax
4000b7: 00 00 00
4000ba: 48 89 04 25 2c 01 60 mov %rax,0x60012c
4000c1: 00
4000c2: 66 0f 2a 0c 25 2c 01 cvtpi2pd 0x60012c,%xmm1
4000c9: 60 00
4000cb: 66 0f 2a 14 25 2c 01 cvtpi2pd 0x60012c,%xmm2
4000d2: 60 00
4000d4: 66 0f 2a 1c 25 2c 01 cvtpi2pd 0x60012c,%xmm3
4000db: 60 00
4000dd: 66 0f 2a 24 25 2c 01 cvtpi2pd 0x60012c,%xmm4
4000e4: 60 00
4000e6: 66 0f 2a 2c 25 2c 01 cvtpi2pd 0x60012c,%xmm5
4000ed: 60 00
4000ef: 66 0f 2a 34 25 2c 01 cvtpi2pd 0x60012c,%xmm6
4000f6: 60 00
4000f8: 66 0f 2a 3c 25 2c 01 cvtpi2pd 0x60012c,%xmm7
4000ff: 60 00
400101: b8 00 00 00 80 mov $0x80000000,%eax
400106: 66 0f 10 cb movupd %xmm3,%xmm1
40010a: 66 0f 10 d5 movupd %xmm5,%xmm2
40010e: 66 0f 5e ca divpd %xmm2,%xmm1
400112: 66 0f 58 e1 addpd %xmm1,%xmm4
400116: 66 0f 10 f7 movupd %xmm7,%xmm6
40011a: 48 ff c8 dec %rax
40011d: 75 e7 jne 0x400106
40011f: b8 3c 00 00 00 mov $0x3c,%eax
400124: bf 00 00 00 00 mov $0x0,%edi
400129: 0f 05 syscall