In an xmm register I have 3 integers with values less than 256. I want to cast these to bytes, and save them to memory. I don't know how to approach it.
I was thinking about getting those numbers from xmm1
and saving them to eax
, then moving the lowest bytes to memory, but I am not sure how to get integers from an xmm register. I can get only element at 0th position, but how to move the rest?
There exists a perfect instruction that would work for me VPMOVDB
, but I can't use it on my processor. Is there some alternative for it?
Asked
Active
Viewed 125 times
3

Sep Roland
- 33,889
- 7
- 43
- 76

thomas113412
- 67
- 1
- 4
-
Is it okay if one additional byte past the end is written? What instruction set extensions are you permitted to use? – fuz Jan 15 '22 at 12:22
-
I can use up to SS4.1 and now that I think I can handle one additional byte. – thomas113412 Jan 15 '22 at 12:31
-
You seem to have a lot of related questions recently (starting with https://stackoverflow.com/questions/70636962/multiplying-and-adding-float-numbers). I would not say that your questions are bad on their own, but you should really try to optimize the whole problem -- e.g., try to avoid the entire `uint8 -> int32 -> float -> int32 -> uint8` conversion chain. – chtz Jan 16 '22 at 11:32
1 Answers
6
The easiest way is probably to use pshufb
to permute the bytes, followed by movd
to store the datum:
; convert dwords in xmm0 into bytes and store into dest
pshufb xmm0, xmmword ptr mask
movd dword ptr dest, xmm0
...
align 16
mask db 0, 4, 8, 12, 12 dup (-1)
This stores 4 bytes instead of 3, so make sure your code can handle that. Storing only 3 bytes is also possible, but requires more work:
; convert dwords in xmm0 into bytes and store into dest
pshufb xmm0, xmmword ptr mask
movd eax, xmm0
mov word ptr dest, ax
bswap eax
mov byte ptr dest+2, ah
...
align 16
mask db 0, 4, 8, 12, 12 dup (-1)
If this happens more than once, you can load the mask ahead of time to avoid the penalty of repeatedly loading it.

fuz
- 88,405
- 25
- 200
- 352
-
4Another way to store three bytes would be `movd`/`mov ax` + `pextrb` to memory, since we can use SSE4.1. Reading AH costs extra latency on modern Intel, so instead of bswap you might just `shr eax, 24` and store AL, for only an extra 1 byte of code size. – Peter Cordes Jan 15 '22 at 16:15
-
Another way to store just three bytes would be to use [`maskmovdqu`](https://www.felixcloutier.com/x86/maskmovdqu) -- but this does not have a very good throughput on most CPUs. – chtz Jan 16 '22 at 11:36