Assembly Performance Tuning

Question

I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).

I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:

#include "stdafx.h"
#include <intrin.h>
#include <iostream>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
    __int64 startval;
    __int64 stopval;
    unsigned int value; // Keep the value to keep from it being optomized out

    startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode

    // Simple Math: a = (a << 3) + 0x0054E9
    _asm {
        mov ebx, 0x1E532 // Seed
        shl ebx, 3
        add ebx, 0x0054E9
        mov value, ebx
    }

    stopval = __rdtsc();
    __int64 val = (stopval - startval);
    cout << "Result: " << value << " -> " << val << endl;

    int i;
    cin >> i;

    return 0;
}

I tried this code swapping eax and ebx but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.

I'd also like to test xor eax, eax vs mov eax, 0, but have the same problem.

Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?

EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.

I doubt your z80 processor had branch prediction, predictive execution, pipelined execution, or such gigantically expensive cache misses compared to cache hits. (I might be wrong on that last one.) Processors are immensely complicated these days. :) — sarnold, Jun 30 '11 at 20:00
@sarnold: of course, for the last one the speed ratio of hist-vs-misses is immense, since z80 don't have cache, you can assume the 0-sized cache gives instantaneous results :) — ninjalj, Jun 30 '11 at 20:08
"I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math)." <-- If this was ever true, it for sure isn't now: x86es all the way back to the Pentium Pro have done comprehensive [register renaming](http://en.wikipedia.org/wiki/Register_renaming). — zwol, Jun 30 '11 at 20:09
@Zack: indeed, also, since the 386, the ISA is much more orthogonal, so advantages due to not having to move data to EAX to perform some operations were minimized even longer ago. — ninjalj, Jun 30 '11 at 20:19
I tried the rdtscp instruction (I'm running on an i7) and it seems to have the same effect as using the CPUID instruction, but because that instruction is variable and costly I do get better results. — Nate Zaugg, Jun 30 '11 at 20:59

Jerry Coffin · Accepted Answer · 2011-06-30T20:24:51.400

10

To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.

You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.

Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.

As such, the normal timing sequence for your code would be something like this:

XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID            ; Intel says by the third execution, the timing will be stable.
RDTSC            ; read the clock
push eax         ; save the start time
push edx

    mov ebx, 0x1E532 // Seed // execute test sequence
    shl ebx, 3
    add ebx, 0x0054E9
    mov value, ebx

XOR EAX, EAX      ; serialize
CPUID   
rdtsc             ; get end time
pop ecx           ; get start time back
pop ebp
sub eax, ebp      ; find end-start
sbb edx, ecx

We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.

Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.

edited Jun 30 '11 at 20:24

answered Jun 30 '11 at 20:15

Jerry Coffin

476,176
80
629
1,111

I'm within 6-8% with the changes you suggest. Thank you! – Nate Zaugg Jun 30 '11 at 20:32
On second glace, make that around 20% – Nate Zaugg Jun 30 '11 at 20:38
`lfence` instead of `cpuid` still serializes instruction execution but is faster (especially in a VM) and doesn't disturb registers. (It also doesn't drain the store buffer). Also, for short intervals you might as well ignore the high half in EDX and just take the low 32 bits of the TSC interval by subtracting `end_eax - start_eax`. – Peter Cordes Feb 19 '21 at 04:31
@PeterCordes: Really? That *certainly* doesn't fit with Intel's documentation. `lfence` should serialize instructions that load from memory, `sfence` instructions that store to memory, and `fence` instructions that either load or store--but `rdtsc` doesn't touch memory, so if the docs are at all accurate, none of these affects it at all. Even if they currently do, Intel has always documented it as having no effect on register-only instructions, so even if it does currently work, they'd be free to break what you're assuming. – Jerry Coffin Feb 19 '21 at 06:06
It doesn't fit with the *name* chosen, but it certainly is documented by Intel these days at least. https://www.felixcloutier.com/x86/lfence. *Specifically, LFENCE does not execute until **all** prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.* Perhaps at one point just an implementation-detail of `lfence` (which is pointless for almost all other purposes, except ordering SSE4.1 `movntdqa` weakly-ordered loads from WC memory), it's now architectural. It's also well known and mentioned in various Intel white-papers on using rdtsc. – Peter Cordes Feb 19 '21 at 06:11
"completed locally" = "retired from the ROB". Note that AMD is potentially different, but OSes that want LFENCE for Spectre mitigation enable the control bit so LFENCE is an execution barrier and can be used for serializing RDTSC. [Is LFENCE serializing on AMD processors?](//stackoverflow.com/q/51844886). `lfence` is nearly useless for anything else; if Intel ever had plans to support a weaker memory model, they abandoned those plans leaving `lfence` basically orphaned of purpose other than this. [Does the Intel Memory Model make SFENCE and LFENCE redundant?](/stackoverflow.com/q/32705169) – Peter Cordes Feb 19 '21 at 06:16
Hmm...I'm reading the Intel docs, and it does look like wording is considerably stronger than the last time I'd looked (though even now, the documentation is rather inconsistent). I'll probably have to do a bit of editing to take this into account. Thanks for the heads-up. – Jerry Coffin Feb 19 '21 at 06:24
If you're going to edit, preemptive terminology note: `lfence` is not a "serializing instruction". That term has a specific technical meaning for x86, and includes draining the store buffer as well as the ROB. (AFAIK that's all it implies on current implementations.) Also related: [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) is my attempt at a canonical answer about RDTSC behaviour and related CPU feature (what Linux calls constant_tsc / nonstop_tsc and so on, some of which have CPUID feature bits). – Peter Cordes Feb 19 '21 at 06:34
Hmm....looking at that, it almost looks like one of these should really be closed as a dupe of the other. – Jerry Coffin Feb 19 '21 at 06:41

score 7 · Answer 2 · answered Jun 30 '11 at 20:51

7

I'd suggest taking a look at Agner Fog's "Software optimization resources" - in particular, the assembly and microarchitecture manuals (2 and 3), and the test code, which includes a rather more sophisticated framework for measurements using the performance monitor counters.

answered Jun 30 '11 at 20:51

Matthew Slattery

45,290
8
103
119

Awesome resource! It's interesting that he did his performance stuff in the Kernel. I thought that it might be necessary to test this stuff in the highest dispatch levels of the Kernel to get predictable results. – Nate Zaugg Jun 30 '11 at 21:04

score 5 · Answer 3 · answered Jun 30 '11 at 20:08

5

The Z80, and possibly the TI, had the advantage of synchronized memory access, no caches, and in-order execution of the instructions. That made it a lot easier to calculate to number of clocks per instruction.

On current x86 CPUs, instructions using AX or EAX are not faster per se, but some instructions might be shorter than the instructions using other registers. That might just save a byte in the instruction cache!

answered Jun 30 '11 at 20:08

Bo Persson

90,663
31
146
203

And that is the conclusion I am coming to on the EAX/Math issue. But still I don't have a good way to test and measure. – Nate Zaugg Jun 30 '11 at 20:10
@Nate - I think it is **very** hard to measure anything at this level. If you find an instruction that has a separate form for the A/AX/EAX/RAX register, it might be a good idea to use that. Otherwise it probably doesn't matter. I haven't worked at this level since the 386, so I have obviously no current info. – Bo Persson Jun 30 '11 at 20:18

score 5 · Answer 4 · answered Jun 30 '11 at 20:10

Go here and download the Architectures Optimization Reference Manual.

There are many myths. I think the EAX claim is one of them.

Also note that you can't talk anymore about 'which instruction is faster'. On today's hardware there are no 1 to 1 relation between instructions and execution time. Some instructions are preferred to others not because they are 'faster' but because they break dependencies between other instructions.

score 4 · Answer 5 · answered Jun 30 '11 at 20:13

4

I believe that if there's a difference nowadays it will only be because some of the legacy instructions have a shorter encoding for the variant that uses EAX. To test this, repeat your test case a million times or more before you compare cycle counts.

answered Jun 30 '11 at 20:13

500 - Internal Server Error

28,327
8
59
66

I've repeated it many millions of times and still get overlaps in which side wins (about 50% of the time even). The numbers are so erratic that the test just has to be invalid. – Nate Zaugg Jun 30 '11 at 20:18

score 4 · Answer 6 · answered Jun 30 '11 at 20:16

You're getting ridiculous variance because rdtsc does not serialize execution. Depending on inaccessible details of the execution state, the instructions you're trying to benchmark may in fact be executed entirely before or after the interval between the rdtsc instructions! You will probably get better results if you insert a serializing instruction (such as cpuid) immediately after the first rdtsc and immediately before the second. See this Intel tech note (PDF) for gory details.

score 3 · Answer 7 · answered Jun 30 '11 at 20:14

3

Starting your program is going to take much longer than running 4 assembly instructions once, so any difference from your assembly will drown in the noise. Running the program many times won't help, but it would probably help if you run the 4 assembly instructions inside a loop, say, a million times. That way the program start-up happens only once.

There can still be variation. One especially annoying thing that I've experienced myself is that your CPU might have a feature like Intel's Turbo Boost where it will dynamically adjust it's speed based on things like the temperature of your CPU. This is more likely to be the case on a laptop. If you've got that, then you will have to turn it off for any benchmark results to be reliable.

answered Jun 30 '11 at 20:14

Bjarke H. Roune

3,667
2
22
26

Good comment on Turbo Boost. When I've added millions of reps in the loop it increases the chance of context switching during my test. At that point the results have more to do with everything else BUT what I am testing. – Nate Zaugg Jun 30 '11 at 20:19
@Nate Zaugg I don't know how the timing mechanism you are using works. If you use a timing mechanism that counts the number of cpu clock cycles you are getting on that process, then context switches shouldn't be a problem. You won't record the time where your code isn't running this way. Context switches may impact your numbers slightly, but if you run enough repetitions then both of the two options you are comparing will receive about the same number of switches on average. – Bjarke H. Roune Jun 30 '11 at 20:43
The RDTSC instruction gets the number of "ticks" since the CPU reset. Therefore it will count the time doing other processes. http://en.wikipedia.org/wiki/Time_Stamp_Counter – Nate Zaugg Jun 30 '11 at 20:48
You could use the C clock function instead. It's not so accurate, but if you run enough repetitions then that is not much of a problem. There may also be a more accurate way to record per-process time. – Bjarke H. Roune Jun 30 '11 at 21:30
Per-process time tends to be inaccurate because incomplete timeslices (e.g. if the OS receives an interrupt) are still assigned to you. – tc. Jul 02 '11 at 01:37

score 3 · Answer 8 · answered Jun 30 '11 at 20:14

I think what the article tries to say about the EAX register, is that since some operations can only be performed on EAX, it's better to use it from the start. This was very true with the 8086 (MUL comes to mind), but the 386 made the ISA much more orthogonal, so it's much less true these days.

Assembly Performance Tuning

8 Answers8

Related