2

I have found answers explaining that direct memory-to-memory copy is not possible on x86 platforms without the value being stored somewhere in between.

mov rax,[RSI]
mov [RDI],rax

I make heavy use of 64bit writes to memory using pop, which appears to copy the values from and to memory directly, without any apparent "middle-man".

Where is the value before it is being written, but after it has been read?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
z0rberg's
  • 674
  • 5
  • 10

2 Answers2

5

The temporary location is a buffer somewhere inside the CPU that isn't part of the architectural state.

On a modern x86 like Skylake, pop [mem] decodes as 2 uops, so presumably the first uop is a pop into an internal register, and the 2nd is a store.

We know that modern x86 CPUs do have a few extra logical registers reserved for use by microcode and multi-uop instructions like this. They're renamed onto the physical register file the same way that architectural registers are. e.g. http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ mentions "some extra architectural registers for internal use". Henry calls them "architectural" registers but that's potentially confusing terminology. He just means logical as opposed to physical, like the architectural registers. These temporary registers aren't (AFAIK) used across instruction boundaries, only within one x86 instruction.

Original 8086 was non-pipelined (except for instruction prefetch) so the internal microcode or logic that implemented pop [mem] presumably just loaded and then stored from some special purpose buffer. Like add [mem], reg but with a different address for the load vs. store and without feeding it through the ALU.

direct memory-to-memory copy is not possible on x86.

You're probably referring to things like the accepted answer on Why IA32 does not allow memory to memory mov? That explanation of the reason is unfortunately just plain wrong and very misleading.

It's an instruction encoding limitation that makes mov [mem], [mem] impossible, not a limitation of CPU internals. See What x86 instructions take two (or more) memory operands?
pop [mem] is one of them because one of the memory operands is implicit. Even original 8086 could do this.


I make heavy use of 64bit writes to memory using pop

If front-end uop throughput or port 2/3 pressure is a bottleneck, consider using 128-bit SSE loads from the stack, then store 64-bit halves with movlps and movhps. On current Intel CPUs (like Skylake), movhps [mem], xmm0 is a single-uop instruction. (Actually micro-fused; all stores are store-address + store-data. But anyway, no port 5 shuffle uop needed like for the useless memory-destination form of pextrq).

Or if your destinations are contiguous, do 128-bit or 256-bit copies.

There are use-cases for pop [mem] but it's not wonderful, and typically not faster on mainstream Intel than pop reg / mov [mem], reg because it's still 2 uops. It does safe code size, and doesn't need a tmp reg, though.

See https://agner.org/optimize/

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
3

pop [rax] is one of the ways to do a memory-to-memory operation. The value popped is probably stored somewhere inbetween, but that's an implementation detail. What is meant by these answers is that instructions using a modr/m byte for their operands can have at most one memory operand. These are most instructions, but not instructions like movsb [rdi], [rsi] which have their operands built into the instruction.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • Wow, I forgot about movsb/etc. I'd say, if you remove the "probably", then I can accept it as an answer. This "probably" bothers me a lot, because it *must* be stored somewhere behind some implementation detail the public knows nothing about, correct? I apologize for what likely comes across as nitpicking or "semantics". – z0rberg's Sep 18 '19 at 22:00
  • @z0rberg's It is possible that someone has implemented x86 in such a way that the datum is directly transferred from one memory location to another (e.g. by connecting the data busses of two memories), but this is unlikely. As there is no specification mandating an intermediate register, I can't say more than “probably.” I could also say “generally;” would that satisfy you? – fuz Sep 18 '19 at 22:02
  • Your response made me chuckle. "connecting the data busses of two memories." It's fine. Thank you for taking the time to write a response. I will accept it as an answer. Too bad there apparently is not more to learn about this. – z0rberg's Sep 18 '19 at 22:04
  • 1
    @z0rberg's Quite on the contrary, there is! Let me see if I can find some further reading that might help you dive deeper into this. – fuz Sep 18 '19 at 22:05
  • 1
    @z0rberg's Sorry, I didn't find anything exactly what goes on in a real x86 processor. I'll get back to you if I find anything specific. – fuz Sep 18 '19 at 22:29
  • Don't worry about it. :) Thank you for your efforts! :D – z0rberg's Sep 18 '19 at 22:39