Questions tagged [x86]

x86 is an architecture derived from the Intel 8086 CPU. The x86 family includes the 32-bit IA-32 and 64-bit x86-64 architectures, as well as legacy 16-bit architectures. Questions about the latter should be tagged [x86-16] and/or [emu8086]. Use the [x86-64] tag if your question is specific to 64-bit x86-64. For the x86 FPU, use the tag [x87]. For SSE1/2/3/4 / AVX* also use [sse], and any of [avx] / [avx2] / [avx512] that apply

The x86 family of CPUs contains 16-, 32-, and 64-bit processors from several manufacturers, with backward-compatible instruction sets, going back to the Intel 8086 introduced in 1978.

There is an x86-64 tag for things specific to that architecture, but most of the info here applies to both. It makes more sense to collect everything here. Questions can be tagged with either or both. Questions specific to features only found in the x86-64 architecture, like RIP-relative addressing, clearly belong in x86-64. Questions like "how to speed up this code with vectors or any other tricks" are fine for x86, even if the intention is to compile for 64bit.

Related tag with tag-wikis:

sse wiki (some good SIMD guides), and avx (not much there)
inline-assembly wiki for guides specific to interfacing with a compiler that way.
intel-syntax wiki and att wiki have more details about the differences between the two major x86 assembly syntaxes. And for Intel, how to spot which flavour of Intel syntax it is, like NASM vs. MASM/TASM.

Learning resources

Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” has a gentle introduction to x86 asm itself for asm beginners who know C or C++, as well a very useful guide to looking at compiler output.

If you don't know how to do something in asm, write a simple C function that does it and see what an optimizing compiler does. e.g. int foo(char *p) { return *p; } shows you how to use movsx. See also How to remove "noise" from GCC/clang assembly output?
Short x86 Assembly Guide targetting 32 bit mode and MASM assembler, but being brief and target-agnostic enough to be used as a starting point for any "Intel" syntax dialect assembler (NASM, YASM, FASM, ...).
Suggestions on how to learn asm, with a recommendation against 16bit DOS. Questions should use the x86-16, emu8086, and/or dos tags if applicable, as well as x86 (which includes all platforms.)
To learn assembly - should I start with 32 bit or 64 bit?
OSdev.org: a great resource if you want to understand / modify OS internals or make your own toy OS. Not useful for writing / debugging normal programs that run under existing OSes.
General Tips for Bootloader Development. (Using legacy BIOS, not UEFI).
Working example of a legacy BIOS int 10h bootloader that loads a "kernel" and calls a C main function in it, in 32-bit protected mode. Includes instructions on how to build and link it with NASM, gcc -m32, and ld (with a linker script). And how to make a disk image and run it on QEMU.
the inline-assembly tag wiki. (But see also https://gcc.gnu.org/wiki/DontUseInlineAsm - inline asm is more complicated than writing stand-alone asm functions you call from C, so it's not good for learning asm.)
Using GNU C/C++ inline ASM. The bottom of that answer has a collection of links to info on how to write inline asm that's efficient and correct. The first part of the answer explains why it's not a good way to learn asm in the first place. Don't try to "get your feet wet" with asm by using inline asm. You have to understand everything to write correct input/output operand constraints and clobbers.
Understanding Carry vs. Overflow conditions/flags, normally relevant for unsigned vs. signed respectively.
Style guide: indenting columns for labels / instructions / operands / comments: a Code Review.SE answer: https://codereview.stackexchange.com/questions/204902/checking-if-a-number-is-prime-in-nasm-win64-assembly/204965#204965

Quick guide to what's different in x86-64. AT&T syntax. NASM and YASM behave differently (from each other) in choice of encoding for mov rax, 1, and don't use a separate movabs mnemonic for the 64bit-immediate form.
Introduction to x64 Assembly (PDF published by Intel). Uses MASM syntax. Spends a bit of time talking about the Windows calling convention and / MSVC-specific toolchain issues (like no MSVC inline asm in 64-bit mode), as you might expect from using "x64" in the article title instead of x86-64. But looks like some good generally-applicable stuff that isn't OS-specific. For some bizarre reason, it suggests using the slow LOOP instruction, so it's not perfect.
A NASM tutorial for x86-64 Linux (nasm -felf64) and MacOS (nasm -fmacho64). Includes some basic SIMD stuff, but forgets to use alignas(16) on the C arrays that require alignment, and uses movaps with integer, movdqa with float. (Which is not a correctness problem, and on most CPUs probably not a performance problem, but is backwards.) Otherwise mostly looks good.
Encoding Real x86 Instructions: a tutorial (course material) on how instructions are encoded into machine code. Lots of diagrams.
x86 on Wikipedia
x86 Assembly wikibook
Assembly Language for x86 Processors (website for Kip Irvine's book)
Programming from the Ground Up, a free (GFDL) book by Jonathan Bartlett. Errata for the book. Available as a small (1MB) PDF from the "download" link on that page, or as HTML chapters . It uses 32-bit x86 asm with AT&T syntax on Linux, and has some good stuff about how to "think like a computer" to figure out how to get things done in asm. It covers some essential operating-system stuff like virtual memory, and things like that necessary to understand what's going on, as well as assembly / machine language itself.
x86-64 Assembly Language Programming with Ubuntu, a free book using YASM (NASM syntax) for GNU/Linux. The PDF is CC-BY-NC-SA. Unfortunately no mention of default rel or [rel x] RIP-relative addressing so it's missing some stuff that's essential in practice. But does have some introductory stuff about basics like data representation, bits and bytes in memory vs. registers, and other background beyond just what each instruction does.
8086 assembler tutorial for beginners - emu8086 (MASM/TASM style) 16-bit only, but starts out with some nice intro stuff about hex vs. decimal, what assembly language is, what registers are and how memory is addressed, and how to look at memory in the debugger, before jumping into how specific instructions work.
Assembly tutorial - Dr. Paul Carter
Windows Assembly Programming Tutorial
Why do functions have to save some registers, but not others? See below for links to guides & docs for specific calling conventions.
How to trace what a function does: figure out the inputs and the outputs, then figure out what it does with them.
Linux x86 Program Start Up or - How the heck do we get to main()
A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux
What do the register-names like esi mean, and what special purposes do they have. They're all acronyms, like Counter register, or Source Index.

Guides for performance tuning / optimisation:

Agner Fog's optimization guides and resources. Includes latency/throughput tables for P5 onwards. Also much qualitative discussion of how to go about making your code faster. Also has a good guide to the different calling conventions across OSes, and covers linking / symbols / relocation.
Intel's Sandybridge microarchitecture family can't micro-fuse indexed addressing modes in the out-of-order core, only in the decoders and uop-cache. Also: Haswell's dedicated store-address unit on port7 only works with simple effective addresses. Complex effective addresses need the AGU on a load port.
Enhanced REP MOVSB for memcpy: single-threaded bandwidth vs. aggregate bandwidth on desktop vs. many-core CPUs, RFO vs. non-RFO stores. (Modern CPUs have more DRAM / L3 bandwidth than a single core can use; there are other bottlenecks especially in many-core chips).
What Every Programmer Should Know About Memory by Ulrich Drepper. (Originally posted as a series of LWN articles, Ulrich published the PDF later). How DRAM and caches work, their behaviour, and how to optimize software for cache locality. Includes some charts with real microbenchmark data to illustrate points, and a cache-blocked SSE2 matrix multiply example. See a 2017 review of what's outdated, e.g. the P4 software prefetch stuff is mostly obsolete.
Why xor same,same is better than mov reg, 0 for zeroing a register There are several reasons, some simple and some subtle (e.g. avoiding partial-register stalls on P6/SnB family).
Serializing RDTSC with LFENCE vs. CPUID for benchmarking short sequences within a program.
How to get the CPU cycle count in x86_64 from C++? (including a bunch of info on what rdtsc measures, exactly, and caveats for using it, with links to even more details).
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?: intro to static performance analysis.
Intel's IACA (Intel Architecture Code Analyzer): analyze marked sections of code for throughput (e.g. cycles per iteration) or latency of the critical path. Assumes perfect cache, and other simplifications, and isn't always correct, but can be useful. Was stalled, but updated again for Skylake-X (AVX512). See What is IACA and how do I use it? for a tutorial.
uiCA (uops.info Code Analyzer) is like IACA but with an accurate model of the front-end fetch/pre-decode/decode (and uop cache or LSD if applicable, I assume) not just 4-wide or 5-wide issue that IACA assumes. See Do 32-bit and 64-bit registers cause differences in CPU micro architecture? for an example output graph.
Haswell microarchitecture, Bulldozer microarchitecture. David Kanter's analysis. He's also done writeups on earlier uarches, like Sandybridge and Nehalem.
Modern Microprocessors A 90-Minute Guide!: from in-order pipelined to super-scalar out-of-order. And brainiac (PPro) vs. speed demon (Pentium 4), and Pentium 4 hitting the "power wall" in CPU design.
A whirlwind introduction to dataflow graphs: how to analyze dependency chains for throughput and latency.
http://www.uops.info/ very detailed uop / execution port testing on Intel CPUs, finding some things that repeating a large block of the same instruction (like Agner Fog's testing) sometimes misses.
New CPUs will usually have AIDA64 InstLatx64 results before Agner Fog can test and publish updated tables. For example, Skylake-avx512, and see also https://github.com/InstLatx64/InstLatx64 for a mirror + a spreadsheet of Skylake-AVX512 port assignments (compiled from IACA-2.3 output). BDW vs. SKL points out some of the interesting changes in SKL (more throughput for more instructions, different FP latency).
2015 IDF slides from the Skylake power management talk Unfortunately the main site (http://myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5) which had video (of slides + audio) is offline now.

Instruction set / asm syntax references:

Intel's vector intrinsics finder/search (very good): search by asm mnemonic or C intrinsic name
x86/x64 SIMD Instruction List (SSE to AVX512) Beta: A nice compact table listing instruction mnemonics and their intrinsics, broken down by type and element-size. Detailed pages with graphical data-movement diagrams for each instruction.
SIMD guides in the SSE tag wiki, focusing on how to actually make good use of SIMD in general, not just what the available instructions are.
Intel's manuals, including instruction set reference manual. Extremely detailed description of everything every instruction does to the architectural state. Big, but has a decent index / table of contents. Also on that page: Intel's optimization manual. Some of the same advice as Agner Fog's guides, but sometimes without explaining exactly why in terms of microarch execution ports and other under-the-hood reasons. Also sometimes obsolete, for example recommending against inc/dec long after P4 is irrelevant.
AMD's x86 manuals, including instruction-set reference and optimization manuals.
HTML version of Intel's insn set reference, auto-generated from the PDF. One page per instruction, great for linking in answers.
Another HTML extract, including AVX512, CLFLUSHOPT, etc.. This makes it more cluttered, and harder to find what you need, if you're not targeting AVX512. (But note that CLFLUSH has changed to being strongly-ordered, but felixcloutier.com's HTML extract still has the old documentation. There may be other inaccuracies in the old docs, even for old instructions.)
https://sandpile.org - CPUID maps, instruction encoding, register diagrams, opcode map, miscellaneous other technical details.
x86 Instruction Reference including when introduced (8086, 186, 586, etc) - NASM appendix B. Includes undocumented instructions, and Cyrix-only MMX instructions, and stuff like that.

A fork of an older version includes English descriptions. The original had some errors in which generation introduced each form of each insn but this version keeps the nice formatting while fixing those. Handy for people still developing for x86-16. The similar wikipedia page doesn't mention that 386 is required for the faster 2-operand form of imul r16, r/m16 that doesn't have to calculate the upper half of the result.
x86 Opcode reference guide, sorted by opcode or by mnemonic. 32, 64, or both in one table. The "geek" version includes non-standard / undocumented opcodes, the "coder" one includes columns showing which if any flags are read and written.
Original 8086 errata / anomalies, such as mov ss, src not properly disabling interrupts until the end of the next instruction. Also see the parent directory for some errata, undocumented instructions, and stuff for 186/286/386.
Simply FPU: x87 tutorial. Helpful for understanding old x87 code, esp. the early sections about how the register stack works. (Use SSE for new code.)
fsin's precision is far worse than 1ulp for inputs close to pi, contrary to Intel's previous documentation. The other FP articles in Bruce Dawson's series are also excellent (index in this one on FP comparisons).
GNU as manual, aka gas manual
The NASM manual
YASM manual: describes YASM syntax and macros. Excellent register diagram showing partial registers, with their machine-code encodings, and a reminder on zero-extending vs. unmodified upper parts. (Another simpler register-subset diagram for a single reg).

Possible canonical duplicates for register subsets: Assembly registers in 64-bit architecture includes some calling-convention / usage stuff. How do AX, AH, AL map onto EAX? is a good one for bugs where AL and RAX were used for different things, corrupting each other.
MASM Reference Documentation, and an old MASM 6.1 manual from 1996. Confusing brackets in MASM32 shows that MASM surprisingly ignores brackets around symbolic immediates.
MASM syntax as used by JWasm. JWasm is a portable assembler.
FASM manual
table of AT&T(GNU) vs. NASM syntax for addressing modes and indirect jmp/call
All the available addressing modes (32/64-bit) (Intel syntax, with a note about NASM vs. MASM for mov reg, symbol), with links to further guides.
AT&T addressing-mode syntax
16-bit addressing modes.
TODO: find a good link for AMD's XOP instruction set. (Not recommended for general use; even AMD is dropping XOP support in their Zen architecture.)
Cheat sheet PDF
Win32-specific cheat sheet

OS-specific stuff: ABIs and system-call tables:

x86 ABIs (wikipedia): calling conventions for functions, including x86-64 Windows and System V (Linux). See also Agner Fog's nice calling convention guide
32-bit absolute addresses no longer allowed in x86-64 Linux? (PIE executables are now the default on most distros, with gcc configured with --enable-default-pie.)
Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array (OS X's image base is above the low 32, unlike Linux position-dependent executables). Also mentions 2 known bugs in some NASM versions with macho64 and RIP-relative or 64-bit absolute addressing.

System V ABI summary on osdev: i386 and x86-64, with links to random copies of the per-architecture supplement for various architectures, and the generic gABI that all the processor-specific supplement (psABI) documents expand on.
System V psABI official standard current revisions for x86-64 and i386 (wiki page on github, kept up to date by H.J. Lu). Direct link to x86-64 revision 1.0. Also links to the official forum for ABI discussion by maintainers/contributors.
clang/gcc sign/zero extend narrow args to 32bit, even though the System V ABI as written doesn't (yet?) require it. Clang-generated code also depends on it.
System V 32bit (i386) psABI (official standard, rev 1.1 Dec2015), used by Linux and Unix. (Some OSes don't require 16-byte stack alignment for 32-bit code; GNU/Linux does)
(Historical: very old SCO version of the i386 SysV ABI, before 16B stack alignment was required).

OS X 32bit x86 calling convention, with links to the others. The 64bit calling convention is System V. Apple's site just links to a FreeBSD pdf for that.

Windows x86-64 __fastcall calling convention
Windows __vectorcall: documents the 32bit and 64bit versions
Windows 32bit __stdcall: used used to call Win32 API functions. That page links to the other calling convention docs (e.g. __cdecl).
ABI cheat sheet: x86 vs. x64 vectorcall and non-vectorcall, vs. SysV. SysV section is incomplete.
Why does Windows64 use a different calling convention from all other OSes on x86-64?: some interesting history, esp. for the SysV ABI where the mailing list archives are public and go back before AMD's release of first silicon.
MSVC's 32bit CRT startup code sets the x87 FPU precision to 53 (double). That entire series of articles (table of contents in this one) is excellent, including asm output from MSVC in some examples.

The Definitive Guide to Linux System Calls (on x86). Examples of how to use int 0x80, 32-bit sysenter, and 64-bit syscall, and how to call through the vDSO for gettimeofday, and has some info about glibc's syscall wrappers. Lots of details, and also some background info / basics for beginners.
Linux system call tables. 64bit syscall numbers, with parameter->register mapping (derived from the kernel source code, and the standard rule for order of args).
FreeBSD system calls: question has FreeBSD syscalls, answer has Linux and others.
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64: Note that 32bit int 0x80 restores all registers (including flags) except eax, while 64bit syscall also clobbers rcx and r11 as well as putting the return value in rax.

16bit interrupt list: PC BIOS system calls (int 10h / int 16h / etc, AH=callnumber), DOS system calls (int 21h/AH=callnumber), and more.

memory ordering:

Weak vs. Strong Memory Models: what it means when people say x86 has a "strongly ordered memory model". See also the c++ info page for many good links if you're using C11/C++11 atomics.
Memory Reordering Caught in the Act: A test case that demonstrates memory reordering in practice on a multicore x86 CPU.
A better x86 memory model: x86-TSO (extended version) A formal definition of the x86 memory model which hopefully matches how real hardware behaves.
Why isn't add dword [num], 1 atomic, even though it's a single instruction. Also asks about compiling num++ in C++. or See also Atomicity on x86: What does it mean for a load or store to be atomic, and how is it implemented internally?

Specific behaviour of specific implementations

TLB and Pagewalk Coherence in x86 Processors. Many x86 microarchitectures, especially Intel's, provide stronger ordering guarantees than the ISA requires for modifying a page-table entry that's not already cached in the TLB. Win95 even depended on this. (Don't write new code that depends on this.)
Measuring Reorder Buffer Capacity Another experimental test that demonstrates the capabilities and limits of out-of-order execution in real hardware.
What are the exhaustion characteristics of RDRAND on Ivy Bridge? With an answer from David Johnston (Intel RNG HW designer and librdrand author).

Q&As with good links, or directly useful answers:

Using GNU C/C++ inline ASM. (Same link from the learning-resources section, but worth repeating here.)
What are the best instruction sequences to generate vector constants on the fly?
Parallel programming using Haswell architecture
Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. Has a long answer including some introductory computer-architecture stuff as well as details of what can stall a Haswell pipeline.
INC instruction vs ADD 1: Does it matter?
How can I run this assembly code on OS X?: OS X getting-started guide. (Symbol names are prepended with _ on OS X, unlike for Linux ELF systems.)
add/sub/LEA can be used with garbage in high bits, so LEA eax, [rdi + rsi*2 - 15] to compute a + 2*b - 15 works fine, even if a and b are only supposed to be 8 or 16 bits.
TODO: find a question about how to use a profiler to measure uops and stuff. perf comes with most Linux distros, and ocperf.py is a wrapper for it that provides more symbolic names for stuff like micro-arch-specific uop counters.

FAQs / canonical answers:

If you have a problem involving one of these issues, don't ask a new question until you've read and understood the relevant Q&A.

(TODO: find better question links for these. Ideally questions that make a good duplicate target for new dups. Also, expand this.)

My program crashes / segfaults: You need to use a debugger to find what instruction is crashing (see the bottom of this tag wiki for GDB and Visual Studio tips). Most buggy asm programs crash, so without more info this is not useful. Reasons can include clobbering registers or stack memory you shouldn't have, leaving esp pointing to the wrong place before a ret, or many many other reasons besides the following other common problems.
external assembly file in visual studio - VS mixed-source x64 project, for asm files as part of a C/C++ program.
Also Assembly programming - WinAsm vs Visual Studio 2017 for a pure asm project.
Building 32bit code on a 64bit system (with the GNU toolchain). gcc example.s makes a binary that runs in 64bit mode, which will crash if the code was written for 32bit mode. Related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?.
Building an executable from asm source that defines _start vs. source that defines main, with gcc/as/ld and/or NASM. With or without libc, and static vs. dynamic executable.
Wide load on narrow data loading or modifying extra bytes, e.g. mov eax, [var] from a db 0.
ret from _start segfaults without making a Linux _exit syscall. ret doesn't work because it's not a function. What happens if there is no exit system call in an assembly program? also covers the case of falling off the end with no ret.

Execution just keeps going if there's no jump or ret, falling through to what's next: What if there is no return statement in a CALLed block of code in assembly programs and Why is no value returned if a function does not explicity use 'ret'.
Code executes condition wrong? fall through from the if into the else body in an if/else. Nicely explains that labels aren't magic and execution falls through them.
Segmentation fault when using DB (define byte) inside a function Putting data where it's executed as code. (Assembly (x86): <label> db 'string',0 does not get executed unless there's a jump instruction for legacy BIOS bootloaders with data at the top.)
idiv / div problems: Zero edx first, or sign-extend eax into it.. 32-bit div faults with #DE if the 64b/32b => 32b quotient doesn't actually fit in 32b. (On POSIX systems including Linux, this raises SIGFPE).

8-bit operand size like div dl is the special case where dx isn't involved, just AX and AH/AL. It still faults if the quotient overflows 8 bits.
No output from printf when I pipe the output, or print something without a newline? When you use the exit system call.
Calling printf in x86_64 using GNU assembler calling convention, stack alignment, and working example. Related NASM-syntax version Segfault while calling C function (printf) from Assembly

Canonical duplicate for scanf segfaulting on misaligned stack in modern Linux builds of glibc: glibc scanf Segmentation faults when called from a function that doesn't align RSP
Library functions modify registers / which registers do my functions need to save and restore? This is specified by the calling convention (part of the ABI) for the platform you're targeting. Search for those terms on this page. What registers must be preserved by an x86 function? is a decent canonical duplicate.
mismatched push/pop: if the stack pointer isn't pointing at the return address when you ret, you crash.
How do I handle multi-digit numbers? Linux, Windows, OS X, and DOS system calls for handling user input/output give you ASCII (or UTF-8) characters, or strings of characters. (Canonical Q&A for single-digit failure to do sub al, '0'). You normally need to convert between strings and binary integers to do math on them, like the C functions atoi or sprintf(buf, "%d", number). None of the common system-call APIs for major OSes that run on x86 provide these functions for you; only as libraries.

string-to-integer (32-bit NASM, algorithm works everywhere). (multiply by 10 for place value) Also includes an int-to-string loop.

Printing integers: 16-bit code to print 16 or 32-bit integers (in dx:ax) (1 digit at a time with MS-DOS int 21h, but could be adapted to store into a string or use a different output method.) Another example for unsigned 16b numbers in DOS that calculates digits and stores them into a string in memory.

2-digit decimal numbers (00-99), using BIOS int 10h for each digit: Displaying Time in Assembly. (Just a special case of the general algorithm, not looping.)

NASM x86-64 function to convert and print a 32-bit unsigned integer (using a single Linux write system call on a buffer). Other answers on the same question show printing one character at a time. AT&T version of the same function, also showing a 5x faster version that uses a multiplicative inverse instead of div to divide by the compile-time constant 10.

How to convert a binary integer number to a hex string? (32-bit NASM code. Scalar, SSE2, SSSE3, AVX512F, and AVX512VBMI versions.)
Loading pointers into registers vs. loading data into registers: Make sure you understand the different between mov reg, symbol and mov reg, [symbol] (NASM syntax), or MASM syntax: mov reg, OFFSET symbol vs. mov reg, symbol. Many beginner questions are caused by mistakes in dereferencing addresses, or not dereferencing. This is the same as pointers in C.
Invalid combination of opcode and operands error on mov [msg], [ebp+8]? You can't use two memory operands to one instruction. (Why IA32 does not allow memory to memory mov?)
Bit-shifts and rotates need the count in cl, not any other register, or as an immediate constant. shl eax, ebx is impossible, shl eax, 2 is fine, and so is shl eax, cl
Call an absolute pointer in x86 machine code or jmp to an absolute address. With examples in NASM and AT&T syntax.
Why do most x86-64 instructions zero the upper part of a 32 bit register? In fact, all instructions that write a 32bit register zero the upper 32 of the full 64bit register, so mov eax, 1234 is more efficient than mov rax, 1234, but equivalent. This is not the case for writing to 8 and 16bit registers, like al/ah/ax, so you need movzx or movsx if the upper bits might hold garbage and you need to clear them (e.g. before using as part of a memory address).
Using LEA on values that aren't addresses / pointers? It's just a shift-and-add ALU instruction that uses memory-operand syntax and machine encoding.
How to tell the length of an x86 instruction? – with an overview over the x86 instruction encoding
Reversing a string? This well-commented answer uses 16-bit ms-dos system calls to read the string, but the actual loop over the string works the same for 32 or 64-bit code.
Indexing an array without scaling the index by the element width, resulting in overlapping loads or stores. Declaring and indexing an integer array of qwords in assembly (x86-64 AT&T syntax)
boot loader works in QEMU but not on real hardware – real computers some times expect the MBR to have a BPB (BIOS parameter block). If the BPB is missing or wrong, the BPB area in the MBR is overwritten with “correct” values, corrupting your boot loader.
How do I do X in assembly: usually the same way you would in another programming language, like C. Figure out what needs to happen to the data before you get bogged down in writing instructions to make it happen.

How to get started / Debugging tools + guides

Find a debugger that will let you single-step through your code, and display registers while that happens. This is essential. We get many questions on here that are something like "why doesn't this code work" that could have been solved with a debugger.

On Windows, Visual Studio has a built-in debugger. See Debugging ASM with Visual Studio - Register content will not display. And see Assembly programming - WinAsm vs Visual Studio 2017 for a walk-through of setting up a Visual Studio project for a MASM 32-bit or 64-bit Hello World console application.

On Linux: A widely-available debugger is gdb. See Debugging assembly for some basic stuff about using it on Linux. Also How can one see content of stack with GDB?

There are various GDB front-ends, including GDBgui. Also guides for vanilla GDB:

With layout asm and layout reg enabled, GDB will highlight which registers changes since the last stop. Use stepi to single-step by instructions. Use x to examine memory at a given address (useful when trying to figure out why your code crashed while trying to read or write at a given address). In a binary without symbols (or even sections), you can use starti instead of run to stop before the first instruction. (On older GDB without starti, you can use b *0 as a hack to get gdb to stop on an error.) Use help x or whatever for help on any command.

GNU tools have an Intel-syntax mode that's similar to MASM, which is nice to read but is rarely used for hand-written source (NASM/YASM is nice for that if you want to stick with open-source tools but avoid AT&T syntax):

clang or gcc -Wall -O3 -masm=intel foo.c -fverbose-asm -S -o- | less (affects inline-asm)
GDB: set disassembly-flavor intel (can go in your ~/.gdbinit)
objdump -drwC -Mintel
perf report -Mintel

Another key tool for debugging is tracing system calls. e.g. on a Unix system, strace ./a.out will show you the args and return values of all the system calls your code makes. It knows how to decode the args into symbolic values like O_RDWR, so it's much more convenient (and likely to catch brain-farts or wrong values for constants) than using a debugger to look at registers before/after an int or syscall instruction. Note that it doesn't work correctly on Linux int 0x80 32-bit ABI system calls in 64-bit processes: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?.

To debug boot or kernel code, boot it in Bochs, qemu, or maybe even DOSBox, or any other virtual machine / simulator / emulator. Use the debugging facilities of the VM to get way better information than the usual "it locks up" you will experience with buggy privileged code.

Bochs is generally recommended for debugging real-mode bootloaders, especially ones that switch to protected mode; Bochs's built-in debugger understands segmentation (unlike GDB), and can parse a GDT, IDT, and page tables to make sure you got the fields right.

For DOS programs, see the x86-16 tag wiki for debuggers that run inside the guest, and thus can debug a specific DOS program maybe more easily than Bochs for the whole system.

REPL (Read Eval Print Loop) environments for typing an instruction and seeing what it does to register values. Maybe only useful for user-space, perhaps not osdev stuff.

16952 questions

2428

votes

10 answers

Why are elementwise additions much faster in separate loops than in a combined loop?

Suppose a1, b1, c1, and d1 point to heap memory, and my numerical code has the following core loop. const int n = 100000; for (int j = 0; j < n; j++) { a1[j] += b1[j]; c1[j] += d1[j]; } This loop is executed 10,000 times via another outer…

c++ performance x86 vectorization compiler-optimization

asked Dec 17 '11 at 20:40

Johannes Gerer

25,508
5
29
35

1619

votes

11 answers

Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs

I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC. The Benchmark #include #include…

c++ performance assembly x86 compiler-optimization

asked Aug 01 '14 at 10:33

gexicide

38,535
21
92
152

933

votes

11 answers

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with: nasm -felf64 p14.asm && gcc p14.o -o p14 The C++…

c++ performance assembly optimization x86

asked Nov 01 '16 at 06:12

rosghub

8,924
4
24
37

859

votes

17 answers

What's the purpose of the LEA instruction?

For me, it just seems like a funky MOV. What's its purpose and when should I use it?

assembly x86 x86-64 x86-16

asked Nov 01 '09 at 20:57

user200557

8,779
3
18
8

370

votes

16 answers

How can I determine if a .NET assembly was built for x86 or x64?

I've got an arbitrary list of .NET assemblies. I need to programmatically check if each DLL was built for x86 (as opposed to x64 or Any CPU). Is this possible?

.net assemblies x86 64-bit x86-64

asked Nov 06 '08 at 22:14

Judah Gabriel Himango

58,906
38
158
212

344

votes

4 answers

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions: Your assignment is the opposite of our first lab assignment,…

c++ optimization x86 intel cpu-architecture

asked May 21 '16 at 09:29

Cowmoogun

2,507
4
12
17

317

votes

12 answers

How to compile Tensorflow with SSE4.2 and AVX instructions?

This is the message received from running a script to check if Tensorflow is working: I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:125]…

tensorflow x86 compiler-optimization simd compiler-options

asked Dec 22 '16 at 23:21

GabrielChu

6,026
10
27
42

308

votes

11 answers

What does multicore assembly language look like?

Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc. With modern CPUs that have 4 cores (or even more), at the machine code level does…

assembly x86 cpu multicore smp

asked Jun 11 '09 at 13:16

Paul Hollingsworth

13,124
12
51
68

306

votes

4 answers

How to run a program without an operating system?

How do you run a program all by itself without an operating system running? Can you create assembly programs that the computer can load and run at startup, e.g. boot the computer from a flash drive and it runs the program that is on the CPU?

assembly x86 operating-system bootloader osdev

asked Feb 26 '14 at 22:13

user2320609

2,059
3
13
6

280

votes

6 answers

What is exactly the base pointer and stack pointer? To what do they point?

Using this example coming from wikipedia, in which DrawSquare() calls DrawLine(), (Note that this diagram has high addresses at the bottom and low addresses at the top.) Could anyone explain me what ebp and esp are in this context? From what I see,…

c assembly x86 stack-frame stack-pointer

asked Sep 08 '09 at 18:37

devoured elysium

101,373
131
340
557

278

votes

3 answers

What is a retpoline and how does it work?

In order to mitigate against kernel or cross-process memory disclosure (the Spectre attack), the Linux kernel1 will be compiled with a new option, -mindirect-branch=thunk-extern introduced to gcc to perform indirect calls through a so-called…

security assembly x86 cpu-architecture spectre

asked Jan 04 '18 at 05:52

BeeOnRope

60,350
16
207
386

263

votes

5 answers

How does the ARM architecture differ from x86?

Is the x86 Architecture specially designed to work with a keyboard while ARM expects to be mobile? What are the key differences between the two?

x86 arm cpu-architecture

asked Feb 10 '13 at 03:39

user1922878

2,833
3
13
7

249

votes

3 answers

How much of ‘What Every Programmer Should Know About Memory’ is still valid?

I am wondering how much of Ulrich Drepper's What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata. (Also in PDF form on Ulrich Drepper's own site:…

optimization memory x86 cpu-architecture cpu-cache

asked Nov 14 '11 at 18:30

Framester

33,341
51
130
192

217

votes

10 answers

What is the difference between Trap and Interrupt?

What is the difference between Trap and Interrupt? If the terminology is different for different systems, then what do they mean on x86?

x86 operating-system kernel interrupt cpu-architecture

asked Jun 30 '10 at 12:23

David

3,190
8
25
31

212

votes

5 answers

The point of test %eax %eax

Possible Duplicate: x86 Assembly - ‘testl’ eax against eax? I'm very very new to assembly language programming, and I'm currently trying to read the assembly language generated from a binary. I've run across test %eax,%eax or test %rdi,…

assembly x86 att

asked Oct 25 '12 at 08:43

pauliwago

6,373
11
42
52

2 3

…

99 100 Next