Is the encoding "66| 48/ 0F 50 D8" in MASM for reg=rbx in "MOVMSKPD reg, xmm" correct?

Question

In MOVMSKPD reg, xmm, VMOVMSKPD reg, xmm2, or VMOVMSKPD reg, ymm2 I think reg is r32 or r64.

But in MASM, I tested and got the following results :

MOVMSKPD rbx, xmm0  ;OK, 66| 48/ 0F 50 D8
MOVMSKPD ebx, xmm0  ;OK, 66| 0F 50 D8

I doubt whether this result is correct especially in that it has "48 prefix". 48h is a REX prefix with W-bit ON.

In contrast, the following codes are encoded in the exact same machine code and each with VEX.W bit zero.

000004F9  C5 F9/ 50 D8          VMOVMSKPD rbx, xmm0
000004FD  C5 F9/ 50 D8          VMOVMSKPD ebx, xmm0

Are all these codes encoded correctly ?

I used ml64.exe. And the target is x86_64 (64-BIT mode).

[test2.asm]

;MOVMSKPD reg, xmm  ;66 0F 50 /r
MOVMSKPD rbx, xmm0  ;OK, with a 48 prefix
MOVMSKPD ebx, xmm0  ;OK, without a 48 prefix
    
;VMOVMSKPD reg, xmm2    ;VEX.128.66.0F.WIG 50 /r
VMOVMSKPD rbx, xmm0 ;OK
VMOVMSKPD ebx, xmm0 ;OK, but the same machine code as above.
    
;VMOVMSKPD reg, ymm2    ;VEX.256.66.0F.WIG 50 /r
VMOVMSKPD rbx, ymm0 ;OK
VMOVMSKPD ebx, ymm0 ;OK, but the same machine code as above.

[test2.lst]

;MOVMSKPD reg, xmm  ;66 0F 50 /r
 000004F0  66| 48/ 0F 50 D8     MOVMSKPD rbx, xmm0
 000004F5  66| 0F 50 D8         MOVMSKPD ebx, xmm0
                    
;VMOVMSKPD reg, xmm2    ;VEX.128.66.0F.WIG 50 /r
 000004F9  C5 F9/ 50 D8         VMOVMSKPD rbx, xmm0
 000004FD  C5 F9/ 50 D8         VMOVMSKPD ebx, xmm0
                    
;VMOVMSKPD reg, ymm2    ;VEX.256.66.0F.WIG 50 /r
 00000501  C5 FD/ 50 D8         VMOVMSKPD rbx, ymm0
 00000505  C5 FD/ 50 D8         VMOVMSKPD ebx, ymm0

Note : I would like to note something that is not directly related but may be relevant.

About the operand-size override prefix (0x66).
In the Intel Manual PDF, I found a sentence that reads "Use of this prefix with MMX, SSE, and/or SSE2 instructions is reserved and may cause unpredictable behavior."

http://gec.di.uminho.pt/Discip/Lesi/AC10203/docs/P4ISAformat.pdf

CHAPTER 2 INSTRUCTION FORMAT
2.2. INSTRUCTION PREFIXES
The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes.
Either operand size can be the default. This prefix selects the non-default size.
Use of this prefix with MMX, SSE, and/or SSE2 instructions is reserved and may cause unpredictable behavior (see the note below).
NOTE
Some of the SSE and SSE2 instructions have three-byte opcodes. For these three-byte opcodes, the third opcode byte may be F2H, F3H, or 66H.
For example, the SSE2 instruction CVTDQ2PD has the three-byte opcode F3 0F E6.
The third opcode byte of these three-byte opcodes should not be thought of as a prefix, even though it has the same encoding as the operand size prefix (66H) or one of the repeat prefixes (F2H and F3H).
As described above, using the operand size and repeat prefixes with SSE and SSE2 instructions is reserved.
It should also be noted that execution of SSE2 instructions on an Intel processor that does not support SSE2 (CPUID Feature flag register EDX bit 26 is clear) will result in unpredictable code execution.

I think REX.W bit of REX prefix resembles a 66 prefix.
So I doubt that the REX.W bit can only be used in legacy instructions and can't be used in MMX, SSE, and/or SSE2 instructions freely (self-judgementally).
I think in a MMX/SSE/AVX instruction, it can be used safely if REX.W is written in the opcode column in the instruction table, but if REX.W is not written there, it can't be used self-judgementally.

I think as follows:

in a legacy instruction, one can use REX.W bit almost freely for GPR especially for a size of a destination operand.
in MMX/SSE/SSE2/AVX/AVX2/AVX512 one can't use REX.W bit freely.

PS1, 2023/07/24, 01:38, JST

AMD Manual: 「AMD 64-Bit Technology, 24108C - January 2001」

Zero-Extension of Results.
In 64-bit mode, when performing 32-bit operations with a GPR destination,the processor zero-extends the 32-bit result into the full 64-bit destination.
8-bit and 16-bit operations on GPRs preserve all unwritten upper bits of the destination GPR.
This is consistent with legacy 16-bit and 32-bit semantics for partial-width results.

Intel Manual: MOVMSKPD—Extract Packed Double Precision Floating-Point Sign Mask

Operation  

(V)MOVMSKPD (128-bit Versions)  
DEST[0] := SRC[63]  
DEST[1] := SRC[127]  
IF DEST = r32  
    THEN DEST[31:2] := 0;  
    ELSE DEST[63:2] := 0;  
FI  
  
VMOVMSKPD (VEX.256 Encoded Version)  
DEST[0] := SRC[63]  
DEST[1] := SRC[127]  
DEST[2] := SRC[191]  
DEST[3] := SRC[255]  
IF DEST = r32  
    THEN DEST[31:4] := 0;  
    ELSE DEST[63:4] := 0;  
FI

In the above, note the following

IF DEST = r32
    THEN DEST[31:2] := 0;
    ELSE DEST[63:2] := 0;
FI

But according to the AMD manual, in 64-BIT mode, if the destination is a 32-BIT General Purpose Register (GPR), upper 32 BITs of the underlying 64-BIT GPR is zero cleared.
But according to the intel manual, if reg is 32-BIT then DEST[31:2] := 0 and if reg is 64-BIT then DEST[63:2] := 0.
I think it is inconsistent if it obeys the AMD general rule using REX.W bit.
If it obeys the AMD general rule, if the destination is 32-BIT then upper 32 bits of the underlying register is zero cleared so that DEST[63:2] := 0. If it is correct, in a manual, writing

IF DEST = r32
  THEN DEST[31:2] := 0;
  ELSE DEST[63:2] := 0;
FI

does NOT make sense, because in both cases, DEST[63:2] := 0.

My point is that "assuming everything is correct", if the instruction MOVMSKPD ebx, xmm0 exists in 64-BIT mode then it does not obey AMD general rule so that it is inconsistent.
I used the "proof by contradiction" or the method of reductio ad absurdum.

PS2, 2023/07/24, 03:05, JST

I think that :

reg is r32 if cpu is in 32-BIT mode (compatible mode).
reg is r64 if cpu is in 64-BIT mode (long mode).

and it can't be controlled by REX.W bit or VEX.W bit.

And I think there is NOT the instruction MOVMSKPD ebx, xmm0 in 64-BIT mode or "Operation pseudo code" in the Intel Manual is not correct in the upper 32 bits of a destination register. According to the AMD general rule, if a destination operand is a 32-BIT GPR when CPU is in 64-BIT mode then upper 32 bits of the underlying 64-BIT GPR is zero cleared, but the Intel Manual says that IF DEST = r32 THEN the upper 32 bits of the underlying 64-BIT GPR is preserved.

PS3, 2023/07/24, 17:24, JST

PS4, 2023/07/25, 01:22, JST

I found the encoding for MOVMSKPS in Appendix B of Intel Manual as Special Case Instructions Promoted Using REX.W.

Vol. 2D B-63

INSTRUCTION FORMATS AND ENCODINGS

B.13 SPECIAL ENCODINGS FOR 64-BIT MODE

The following Pentium, P6, MMX, SSE, SSE2, SSE3 instructions are promoted to 64-bit operation in IA-32e mode by using REX.W. However, these entries are special cases that do not follow the general rules (specified in Section B.4).

Table B-34. Special Case Instructions Promoted Using REX.W (Contd.)

PS5, 2023/07/25, 05:49, JST

3.1.1.1 Opcode Column in the Instruction Summary Table (Instructions without VEX Prefix)

REX.W — Indicates the use of a REX prefix that affects operand size or instruction semantics. The ordering of the REX prefix and other optional/mandatory instruction prefixes are discussed Chapter 2. Note that REX prefixes that promote legacy instructions to 64-bit behavior are not listed explicitly in the opcode column.

I think it is important that it limits omitting a REX prefix to promote to 64-bit behavior in the opcode column only to "legacy instructions". Therfore for not legacy instuructions, basically REX prefixes to promote those to 64-bit behavior are listed explicitly in the opcode columns.

3.1.1.3 Instruction Column in the Opcode Summary Table

reg — A general-purpose register used for instructions when the width of the register does not matter to the semantics of the operation of the instruction. The register can be r16, r32, or r64.

on the other hand :

r/m8 — A byte operand that is either the contents of a byte general-purpose register (AL, CL, DL, BL, AH, CH, DH, BH, BPL, SPL, DIL, and SIL) or a byte from memory. Byte registers R8B - R15B are available using REX.R in 64-bit mode.

In the case of r/m8, r represents a 8-bit GPR.
In the case of r/m16, r represents a 16-bit GPR.
In the case of r/m32, r represents a 32-bit GPR.
In the case of r/m64, r represents a 64-bit GPR.

I think in reg/m8, reg represents a r16/r32/r64 GPR and in the instruction the width of the register does not matter to the semantics of the operation of the instruction.

So, I think

In the case of reg/m8, reg represents a r16/r32/r64 GPR.
In the case of reg/m16, reg represents a r16/r32/r64 GPR.
In the case of reg/m32, reg represents a r16/r32/r64 GPR.
In the case of reg/m64, reg represents a r16/r32/r64 GPR.

PS6, 2023/07/25, 06:32, JST

Intel Manual

3-12 Vol. 1

3.4.1.1 General-Purpose Registers in 64-Bit Mode

When in 64-bit mode, operand size determines the number of valid bits in the destination general-purpose register:

64-bit operands generate a 64-bit result in the destination general-purpose register.
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register.
8-bit and 16-bit operands generate an 8-bit or 16-bit result. The upper 56 bits or 48 bits (respectively) of the destination general-purpose register are not modified by the operation. If the result of an 8-bit or 16-bit operation is intended for 64-bit address calculation, explicitly sign-extend the register to the full 64-bits.

The W bit is ignored for `vmovmskpd` as explained in the instruction chart where it says WIG. It seems like the assembler you tested interprets this as “never set the W bit.” — fuz, Jul 23 '23 at 11:02
Yes, all these codes are encoded correctly. In VEX-encoded instructions, such as `VMOVMSKPD`, bits of REX are encoded differently, see the chapter **2.3.5** in [manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-2a-manual.pdf) — vitsoft, Jul 23 '23 at 11:04
Your bit about the `REX.W` prefix being restricted is wrong. It can be used for most instructions. Note that the VEX and EVEX prefixes incorporate the information given by the REX prefix, so a REX prefix before a VEX or EVEX prefix is neither needed nor permitted. — fuz, Jul 23 '23 at 11:05
I think that reg is r32 if cpu is in 32BIT mode(compatible mode) and reg is r64 if cpu is in 64BIT mode(long mode) and it can't be controlled REX.W bit or VEX.W bit. — YutakaAoki, Jul 23 '23 at 18:14
Explicit zero-extension to 64-bit with a REX.W or VEX.W is possible, but will always behave identically to implicit zero-extension to 64-bit by writing a 32-bit register. ([Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?](https://stackoverflow.com/q/11177137)). So it's always a waste of a REX prefix. I expected NASM to optimize this like it does for `mov rax, 1` into `mov eax, 1` (unless you use `nasm -O0`), but it actually doesn't, it encodes `movmskpd rax, xmm0` as `66 48 0f 50 c0` instead of `66 0f 50 c0 movmskpd eax, xmm0` — Peter Cordes, Jul 23 '23 at 18:59
Intel's manual (https://www.felixcloutier.com/x86/movmskpd) does *not* say the upper bits of `rbx` will be preserved if you write `ebx`. It just doesn't mention them because they aren't part of the explicit destination, so the general rule about writing 32-bit registers implicitly zero-extending applies. It's kind of pointless for the manual to bother documenting any difference between operand-sizes, but before AVX-512 was designed with separate mask registers, perhaps they were thinking about how `vpmovmskb` would extend to 64-byte vectors to produce a 64-bit integer. — Peter Cordes, Jul 23 '23 at 19:04
So basically you should always use a 32-bit register in your asm source code, to stop assemblers from wasting a REX prefix. Unless you want to make the instruction longer for padding. If you actually try it, like `mov rax, -1` / `movmskpd eax, xmm0`, you'll see the upper bits of RAX are zero, not still FF. Your guess makes a testable prediction: you can and should test it to see that it's not right. — Peter Cordes, Jul 23 '23 at 19:05
@YutakaAoki Your understanding of the size of the *reg* operand of this instruction is correct. — fuz, Jul 23 '23 at 19:50