x86 assembly idioms

Question

I've been trying to get a good hold on the x86 assembly language, and was wondering if there was a quick-and-short equivalent of movl $1, %eax. That's when I thought that a list of idioms used frequently in the language would perhaps be a good idea.

This could include the preferred use of xorl %eax, %eax as opposed to movl $0, %eax, or testl %eax, %eax against cmpl $0, %eax.

Oh, and kindly post one example per post!

`movl $1, %eax` is pretty quick and short. On some processors, `xorl %eax, %eax` is actually slower than `movl $0, %eax`. On others, `incl %eax` is slower than `addl $1, %eax`. If you are going to the trouble of writing assembly in 2010, you should know for what architecture you are writing and select your "dialect" (to keep with the linguistic metaphor) in consequence. — Pascal Cuoq, Apr 15 '10 at 17:48
@Pascal Cuoq, could you please explain what factors affect this sort of a difference in performance? I am especially baffled by `incl %eax` being slower than `addl $1, %eax`. Additionally, if you could point me to some link which details this sort of behaviour, I will be grateful! — susmits, Apr 15 '10 at 18:11
For all x86 architectures in 2010 xor eax,eax ist faster or equivalent, in any case it is shorter. Have a look a http://stackoverflow.com/questions/1396527/any-reason-to-do-a-xor-eax-eax/1396552#1396552. This is pretty much since the days of 486. — Gunther Piez, Mar 07 '11 at 19:50
Voting to close as too broad. The individual examples mentioned have already been raised in other posts. — Ciro Santilli OurBigBook.com, Aug 12 '15 at 15:39

score 13 · Answer 1 · answered Apr 16 '10 at 14:33

Here's another interesting "idiom". Hopefully everyone knows that division is a big time sink even compared to a multiplication. Using a little math, it's possible to multiply by the inverse of constant instead of dividing by it. This goes beyond the shr tricks. For example, to divide by 5:

mov eax, some_number
mov ebx, 3435973837    // 32-bit inverse of 5
mul ebx

Now eax has been divided by 5 without using the slow div opcode. Here is a list of useful constants for division shameless stolen from http://blogs.msdn.com/devdev/archive/2005/12/12/502980.aspx

3   2863311531
5   3435973837
7   3067833783
9   954437177
11  3123612579
13  3303820997
15  4008636143
17  4042322161

For numbers not on the list, you might need to do a shift beforehand (to divide by 6, shr 1, then multiply by the inverse of 3).

score 7 · Answer 2 · answered Apr 15 '10 at 17:57

7

on x64:

xor eax, eax

for

xor rax, rax

(the first one also implicitly clears the upper half of rax, but has a smaller opcode)

answered Apr 15 '10 at 17:57

PhiS

4,540
25
35

score 7 · Answer 3 · answered Apr 15 '10 at 18:01

7

Using LEA for e.g. multiplication, like:

lea eax, [ecx+ecx*4]

for EAX = 5 * ECX

answered Apr 15 '10 at 18:01

PhiS

4,540
25
35

5

BTW: this is dog slow on NetBurst, because Intel removed the barrel-shifter in order to be able to obtain higher clock speeds. Ironically, at the time the P4 came out, this was still documented in Intel's optimization manuals. – Jörg W Mittag Apr 15 '10 at 18:39
Thanks for the comment re. speed. I realise that an idiom is not necessarily the same thing as an optimisation. However, as an idiom, I think LEA has been fairly widely (ab)used. – PhiS Apr 15 '10 at 18:52
5

Well, it *is* an optimization. And it is even officially recommended by Intel. It's just that, after officially recommending it for 15 years, they suddenly releases a new CPU on which it was slow, thus essentially requiring recompiling *every single program ever written*. Thankfully, NetBurst died a quick and painful death and all current microarchitectures are evolutions of the Pentium III, not the Pentium4, so all current CPUs again have a barrel shifter. Basically, *all* Intel CPUs since 80385 and all Athlons have it, only the Pentium4 doesn't. – Jörg W Mittag Apr 16 '10 at 01:55

Sparafusile · Answer 4 · 2010-04-21T13:46:51.143

5

You might as well as how to optimize in assembly. Then you'd have to ask what you're optimizing for: size or speed? Anyway, here's my "idiom", a replacement for xchg:

xor eax, ebx
xor ebx, eax
xor eax, ebx

edited Apr 21 '10 at 13:46

answered Apr 15 '10 at 18:02

Sparafusile

4,696
7
34
57

**WARNING:** If eax == ebx - Both will be zeroed! – LiraNuna Apr 15 '10 at 18:42
12

Are you sure about that? 42 ^ 42 = 0 ; 42 ^ 0 = 42 ; 0 ^ 42 = 42 – Sparafusile Apr 15 '10 at 18:53

Pascal Cuoq · Answer 5 · 2010-04-15T18:38:21.313

Expanding on my comment:

To an undiscerning processor such as the Pentium Pro, xorl %eax, %eax appears to have a dependency on %eax and thus must wait for the value of that register to be available. Later processors actually have additional logic to recognize that instruction as not having any dependencies.

The instructions incl and decl set some of the flags but leave others unchanged. That's the worst situation if the flags are modelized as a single register for the purpose of instruction reordering: any instruction that reads a flag after an incl or decl must be considered as depending on the incl or decl (in case it's reading one of the flags that this instruction sets) and also on the previous instruction that set the flags (in case it's reading one of the flags that this instruction does not set). A solution would be to divide the flags register into two and to consider dependencies with this finer grain... but AMD had a better idea and removed these instructions entirely from the 64-bit extension they proposed a few years back.

Regarding the links, I found this either in the Intel manuals for which it's useless to provide a link because they are on a corporate website that's reorganized every six months, or on Agner Fog's site: http://www.agner.org/optimize/#manuals

score 5 · Answer 6 · answered Apr 16 '10 at 13:39

5

At loops...

  dec     ecx 
  cmp     ecx, -1       
  jnz     Loop

is

  dec     ecx  
  jns     Loop

Faster and shorter.

answered Apr 16 '10 at 13:39

GJ.

10,810
2
45
62

Isn't loop .Loop easier? – Hasan Saad Nov 19 '13 at 15:16
1

@Hasan Saad: It is but ti is slower, using loop in x86 is deprecated. – GJ. Nov 19 '13 at 19:15
Thanks a lot :) I had no idea about that so thanks for the information. Highly appreciated :) – Hasan Saad Nov 20 '13 at 14:48

score 3 · Answer 7 · answered Apr 15 '10 at 18:03

3

Using SHL and SHR for multiplication/division by a power of 2

answered Apr 15 '10 at 18:03

PhiS

4,540
25
35

It can be extended to other numbers as well. E.g., `y*320 = (y << 8) + (y << 6)`. That may not always be faster than a simple multiplication, though. Depends on your processor. – csl Jun 23 '16 at 09:13

PhiS · Answer 8 · 2010-04-15T18:13:19.853

2

Another one (beside xor) for

mov eax, 0   ; B800000000h

is

sub eax, eax ; 29C0h

Rationale: smaller opcode

edited Apr 15 '10 at 18:13

answered Apr 15 '10 at 18:07

PhiS

4,540
25
35

score 2 · Answer 9 · answered Apr 15 '10 at 18:29

2

Don't know whether this counts as an idiom, but on most processors prior to i7

movq xmm0, [eax]
movhps xmm0, [eax+8]

or, if SSE3 is available,

lddqu xmm0, [eax]

are faster for reading from an unaligned memory location than

movdqu xmm0, [eax]

answered Apr 15 '10 at 18:29

PhiS

4,540
25
35

score 1 · Answer 10 · answered Mar 07 '11 at 17:56

1

The earliest reference to division by invariant integers that is more than just an inverse multiply is here: Torbjörn Granlund of The Royal Institue of Technology in Stockholm. Check out his publications

answered Mar 07 '11 at 17:56

Olof Forshell

3,169
22
28

x86 assembly idioms

10 Answers10