The temporary location is a buffer somewhere inside the CPU that isn't part of the architectural state.
On a modern x86 like Skylake, pop [mem]
decodes as 2 uops, so presumably the first uop is a pop
into an internal register, and the 2nd is a store.
We know that modern x86 CPUs do have a few extra logical registers reserved for use by microcode and multi-uop instructions like this. They're renamed onto the physical register file the same way that architectural registers are. e.g. http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ mentions "some extra architectural registers for internal use". Henry calls them "architectural" registers but that's potentially confusing terminology. He just means logical as opposed to physical, like the architectural registers. These temporary registers aren't (AFAIK) used across instruction boundaries, only within one x86 instruction.
Original 8086 was non-pipelined (except for instruction prefetch) so the internal microcode or logic that implemented pop [mem]
presumably just loaded and then stored from some special purpose buffer. Like add [mem], reg
but with a different address for the load vs. store and without feeding it through the ALU.
direct memory-to-memory copy is not possible on x86.
You're probably referring to things like the accepted answer on Why IA32 does not allow memory to memory mov? That explanation of the reason is unfortunately just plain wrong and very misleading.
It's an instruction encoding limitation that makes mov [mem], [mem]
impossible, not a limitation of CPU internals. See What x86 instructions take two (or more) memory operands?
pop [mem]
is one of them because one of the memory operands is implicit. Even original 8086 could do this.
I make heavy use of 64bit writes to memory using pop
If front-end uop throughput or port 2/3 pressure is a bottleneck, consider using 128-bit SSE loads from the stack, then store 64-bit halves with movlps
and movhps
. On current Intel CPUs (like Skylake), movhps [mem], xmm0
is a single-uop instruction. (Actually micro-fused; all stores are store-address + store-data. But anyway, no port 5 shuffle uop needed like for the useless memory-destination form of pextrq
).
Or if your destinations are contiguous, do 128-bit or 256-bit copies.
There are use-cases for pop [mem]
but it's not wonderful, and typically not faster on mainstream Intel than pop reg
/ mov [mem], reg
because it's still 2 uops. It does safe code size, and doesn't need a tmp reg, though.
See https://agner.org/optimize/