4

What is the fastest way to set a single memory cell to zero in x86? Typically the way I do it is this:

C745D800000000  MOV [ebp-28], 0

As you can see this has a pretty chunky encoding since it is using all 4 bytes for the constant. With a plain register I can use MVZE which is more compact, but MVZE does not work with memory.

I was thinking maybe clear a register, then MOV the register value to the memory. Then, it would be two instructions, but only 5 bytes total instead of the one 7-byte instruction above. Following the rule "if its shorter, its usually faster", this might be preferable.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Tyler Durden
  • 11,156
  • 9
  • 64
  • 126
  • You can XOR it with itself, but I don't think that would be faster: `xor [ebp-28], [ebp-28]`. – Linuxios Mar 15 '13 at 22:24
  • 2
    You can't XOR a memory cell with itself, since an instruction cannot have two memory operands. – Daniel Kamil Kozar Mar 15 '13 at 22:24
  • Some x86 instructions have two memory operands.... – Carl Norum Mar 15 '13 at 22:28
  • @CarlNorum : would you care to give an example? – Daniel Kamil Kozar Mar 15 '13 at 23:21
  • @Daniel, `movs` has both source & destination memory pointers. – Carl Norum Mar 16 '13 at 00:00
  • 2
    Sure, but they're not explicitly stated as the operands to this instruction and encoded directly with it. – Daniel Kamil Kozar Mar 16 '13 at 00:25
  • FWIW `push` is another mem->mem instruction. – Igor Skochinsky Mar 18 '13 at 12:05
  • @DanielKamilKozar [What x86 instructions take two (or more) memory operands?](https://stackoverflow.com/q/52573554/995714). Yes, an instruction can have at most one explicit memory operand – phuclv Feb 01 '19 at 15:56
  • @Linuxios [Why isn't movl from memory to memory allowed?](https://stackoverflow.com/q/33794169/995714), [Why can't one instruction include two memory references in assembly?](https://stackoverflow.com/q/17514527/995714) – phuclv Feb 01 '19 at 15:58
  • 2
    When you say "a single cell", do you mean a byte? Or do you mean a dword / qword (where `mov` would require an `imm32`)? – Peter Cordes Feb 02 '19 at 09:01
  • @PeterCordes I mean a word of memory, which on a 32-bit machine would be 4 bytes, or on a 64-bit machine would be 8 bytes. But I am open to answers that would zero only a single byte. In general, since CPUs are architected around manipulating words, then that is the anticipated subject of the question. – Tyler Durden Feb 02 '19 at 21:46
  • 1
    You tagged this x86; they're architected around unaligned loads/stores and single bytes. Apparently many non-x86 CPUs actually do a RMW cycle in cache to update a byte within a word for a byte store ([Are there any modern/ancient CPUs / microcontrollers where a cached byte store is actually slower than a word store?](//stackoverflow.com/q/54217528)), but modern ISAs are all byte-addressable and all have architectural byte stores. ([Can modern x86 hardware not store a single byte to memory?](//stackoverflow.com/q/46721075)). (except early Alpha, if you consider it modern). – Peter Cordes Feb 02 '19 at 21:52
  • @PeterCordes Okay, then consider the question to be zeroing a byte of memory. Throw me a bone here, I haven't gotten too much intelligent response on this question of any type. You are beating a dead horse, and there isn't much of a horse to beat. – Tyler Durden Feb 02 '19 at 22:13
  • A couple of strategies I use when reversing/patching binaries where I need to keep the same number of bytes to achieve moving 0 into a memory address instead of what's originally referenced: 1) If you know that the state of a particular register will be 0 at that instruction every time you execute your application, then you could just `mov` from said empty register to your memory address, thus sparing the step of clearing a register. 2) You could `pop [ebp-28]` if you know the top of the stack reliably contains 0s and you don't have to worry about the incremented stack pointer causing a crash. – dsasmblr Aug 19 '23 at 17:11

2 Answers2

5

Unfortunately, what you have written here is the only way to "directly" zero out a memory cell. Of course, XORing out a register and then moving it to some memory location would also work, but I don't know if that would be any faster.

If you happen to have a register whose value is zero and you're sure of it, then by all means use it. Otherwise, just stick with the mov [ebp-28], 0. Keep in mind that mem, imm operands are known to be one of the slowest : if you profile your code and find out that this is a bottleneck, try initializing a register to zero at the beginning of your function (or whatever) and then using it throughout the code, as a sort of a predefined constant.

Daniel Kamil Kozar
  • 18,476
  • 5
  • 50
  • 64
  • Do you know if this is also the shortest way? On x86 64 `mov [r14], 0` is a 7 byte instruction. – Björn Lindqvist Mar 05 '15 at 13:01
  • 4
    @Björn On x86-64, `xor eax, eax` + `mov [r14d], rax` would be only 5 bytes. (You don't need to XOR the 64-bit register `rax` because all operations on 32-bit registers implicitly clear the upper half, and they are shorter to encode.) This may not necessarily be *faster*, though, than a `mov mem, imm`. But like Daniel says, it'd be an obvious, massive win if you had any other use for the value 0 in that same function, especially since on x86-64, you virtually always have registers to space. The decision is a bit harder on x86-32, where you'd be giving up a valuable register as a zero-register. – Cody Gray - on strike Dec 16 '16 at 15:21
  • 1
    Fun fact: Intel CPUs can't micro-fuse an instruction with a RIP-relative addressing mode and an immediate, so `mov dword [rel label], 0` decodes as a 2-uop instruction. So for static data on x86-64, it's pure win to `xor`-zero a register first if you're tuning for Intel CPUs. – Peter Cordes Feb 02 '19 at 08:58
2

If you expect your data to be out of the cache, and you don't expect to access it again soon, MASKMOVDQU might be the fastest way. This allows you to write one or more bytes without affecting surrounding bytes and without waiting for a request-for-ownership request to bring the associated cache line into memory.

Essentially, the write is sent directly down to memory, rather than the other way around. Since the CPU interacts with memory in cache-line sized chunks, what is happening under the covers is that the cache line containing the write is send down, along with a mask indicating which bytes are actually be updated. Either at the memory controller, L3 cache or in the memory itself, the bytes to be written are then merged with the bytes that should be left alone.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386