How do (x86) virtual machines generally handle flags?

Question

For a side project, I am attempting to write a semi-programmable x86 virtual machine.

I understand the formats so most of the design is relatively simple, but after executing an instruction with its operands, flags are often changed. It would be very inefficient to check each potential bit, so I was thinking of popping the flag register into the VM, ANDing it, and then setting the VM's flag register. However, this is still a lot of overhead.

This is borderline opinionated, but is there something I am missing?

Uhm, normally hypervisors don't emulate an instruction pee step, but let the guest code run (mostly) unmodified, trapping in the hypervisor kernel when the hardware detects a privileged instruction. Of course in practice it's way more complicated than this, but the point is that generally you need to worry about the state of the flags register only when you trap in hypervisor, not after each and every instruction. — Matteo Italia, Oct 27 '16 at 19:18
I don't understand what you're doing. If this is a virtual machine, you'd normally let code execute and do its thing until you get to a vmcall, and basically not care about flags. If this is a "naive" emulator and you are reimplementing the x86 instruction set in a higher-level language, then you can be clever and avoid calculating flags in most cases (either by delaying it to later or by checking ahead which flags could be relevant). — zneak, Oct 27 '16 at 20:32
Either way, "popping the flag register into the VM" is an unusual way to phrase things, because you normally pop things out of a stack. It's also unclear to me what could require ANDing; x86 instructions can either set (or clear a flag, leave it unmodified, or make it undefined, so the operation should be something like `vm_flags = output_flags | (vm_flags & unmodified_flags_mask)`. — zneak, Oct 27 '16 at 20:35
If you're purely emulating instructions (meaning you're never running instructions natively on the CPU, using VT-X or not), please edit your question to indicate you're building an *emulator*. — Jonathon Reinhart, Oct 27 '16 at 20:44

Alexis Wilke · Accepted Answer · 2016-10-27T20:23:24.510

If you want your emulator to simulate the processor as it is, then yes, you need to emulate the flags exactly.

This means clearing bits that need to be cleared (with AND), setting bits that need to be set (with OR), and copying / calculating bits when required (i.e. the Z flag requires testing whether the result is zero, the carry requires to know whether you have an overflow, etc.)

There is no way around it.

This is just like decoding the R/M mod byte. You have no way around having to load that byte, check the mode to determine whether this is a register or a memory access, and apply those accordingly...

And in effect, that means your emulator will be "much slower" (unless you are emulating an old 10Mhz processor with a 3Ghz modern processor, when you anyway have time to execute 300 cycles of instructions... so you should be just fine.)

If you're interested, I wrote a 6502 emulator and got it tested with an Apple 2 ROM. I had to add sleeps to not have it run at 100Mhz or more... (that processor was originally running 1Mhz...)

There is a way around; it's called lazy evaluation: save enough info to calc flags, but don't do it yet. x86 writes flags more often than it reads them. — Peter Cordes, Oct 27 '16 at 20:37
@PeterCordes Yeah, I would imagine you could keep the last result in a variable and use it if you need to evaluate the Z, C, V and maybe some other flags too.It may be faster than evaluating the flags each time. It may be complicated in some situations though, like _conflicts_ in calculating C between an `ADC` and a `ROL` instruction, for example. — Alexis Wilke, Oct 27 '16 at 21:04

score 6 · Answer 2 · edited May 23 '17 at 12:30

You appear to be asking about emulating x86, not virtualizing it. Since modern x86 hardware supports virtualization, where the CPU runs guest code natively and only traps to the hypervisor for some privileged instructions, that's what the term "virtualization" normally means.

Lazy flag evalution is typical. Instead of actually calculating all the flags, just save the operands from the last instruction that set flags. Then if something actually reads the flags, figure out what the flag values need to be.

This means you don't actually have to calculate PF and AF every time they're written (almost every instruction), only every time they're read (mostly only PUSHF or interrupts, hardly any code ever reads PF (except for FP branches where it means NaN)). Computing PF after every integer instruction is expensive in pure C, since it requires a popcount on the low 8 bits of results. (And I think C compilers generally don't manage to recognize that pattern and use setp themselves, let alone a pushf or lahf to store multiple flags, if compiling an x86 emulator to run on an x86 host. They do sometimes recognize population-count patterns and emit popcnt instructions, though, when targetting host CPUs that have that feature (e.g. -march=nehalem)).

BOCHS uses this technique, and describes the implementation in in some detail in the Lazy Flags section of this short pdf: How Bochs Works Under the Hood 2nd edition. They save the result so they can derive ZF, SF, and PF, and the carry-out from the high 2 bits for CF and OF, and from bit 3 for AF. With this, they never need to replay an instruction to compute its flag results.

There are extra complications from some instructions not writing all the flags (i.e. partial-flag updates), and presumably from instructions like BSF that set ZF based on the input not the output.

Further reading:

This paper on emulators.com gives a lot of details on how to efficiently save enough state to reconstruct flags. It has a "2.1 Lazy Arithmetic Flags for CPU Emulation".

One of the authors is Darek Mihocka (long time emulator writer, now working at Intel apparently). He has written much interesting stuff about making non-JIT emulators run fast, and CPU performance stuff in general, much of it posted on his site, http://www.emulators.com/. E.g. this article about avoiding branch-misprediction in an emulator's interpreter loop that dispatches to functions that implement each opcode is quite interesting. Darek is also the co-author of that article about BOCHS internals I linked earlier.

A google hit for lazy flag eval may also be relevant: https://silviocesare.wordpress.com/2009/03/08/lazy-eflags-evaluation-and-other-emulator-optimisations/

Last time emulation of x86-like flags came up, the discussion in comments on my lazy-flags answer had some interesting stuff: e.g. @Raymond Chen suggested that link to the Mihocka & Troeger paper, and @amdn pointed out that JIT dynamic translation can produce faster emulation than interpretation.

How do (x86) virtual machines generally handle flags?

2 Answers2