Are you really using SIMD on VGA text-mode video memory in x86-64 mode? This is amusing, but actually plausible in real life, and works as a use-case for some SIMD data manipulation.
However, if you're really reading from video memory then you might be doing uncached loads, which is bad and implies you should redesign your system so you don't have to do that. (See Ross's answer for suggestions)
On USWC video memory, you can get a big speedup from MOVNTDQA. See Intel's article, and a couple of my answers about NT loads: here and especially this one where I explain what the x86 ISA manuals say about NT loads not overriding the memory ordering semantics, so they're not weakly ordered unless you use them on weakly-ordered memory regions.
As you suspected, you won't find copy instructions in SIMD instruction sets; you have to do the data processing yourself in registers beween loads and stores. There isn't even a single SSE/AVX instruction that will do this for you. (ARM NEON's unzip instruction does solve the whole problem, though).
You should use SSE2 PACKUSWB, to pack two vectors of (signed) int16_t down to one vector of uint8_t. After zeroing the upper byte of each word element, saturating to 0..255 won't modify your data at all.
Here's a real (untested) loop that aligns the source pointer to minimize penalties from crossing cache-line boundaries, and uses some addressing-mode tricks to save instructions in the loop.
Unaligned loads have very little penalty on Nehalem and later, mostly just extra latency when they cross a cache-line boundary. So this is mostly useful if you want to use NT loads from video memory. Or this is maybe useful if you would otherwise read beyond the end of the src at the end of large copies.
We do twice as many loads as stores, so if load/store throughput was an issue aligned loads (instead of aligned stores) might be optimal. However, there's too much ALU work to saturate cache load/store throughput, so keeping it simple with unaligned loads (like Paul R's loop) should work very well on most CPUs and use-cases.
mov edx, CMD_BUFFER ; or RIP-relative LEA, or hopefully this isn't even static in the first place and this instruction is something else
;; rdi = source ; yes this is "backwards", but if you already have the src pointer in rdi, don't waste instructions
;; rcx = count
;; rdx = dest
pcmpeqw xmm7, xmm7 ; all ones (0xFF repeating)
psrlw xmm7, 8 ; 0x00FF repeating: mask for zeroing the high bytes
;cmp ecx, 16
;jb fallback_loop ; just make CMD_BUFFER big enough that it's ok to copy 16 bytes when you only wanted 1. Assuming the src is also padded at the end so you can read without faulting.
;; First potentially-unaligned 32B of source data
;; After this, we only read 32B chunks of 32B-aligned source that contain at least one valid byte, and thus can't segfault at the end.
movdqu xmm0, [rdi] ; only diff from loop body: addressing mode and unaligned loads
movdqu xmm1, [rdi + 16]
pand xmm0, xmm7
pand xmm1, xmm7
packuswb xmm0, xmm1
movdqu [rdx], xmm0
;; advance pointers just to the next src alignment boundary. src may have different alignment than dst, so we can't just AND both of them
;; We can only use aligned loads for the src if it was at least word-aligned on entry, but that should be safe to assume.
;; There's probably a way to do this in fewer instructions.
mov eax, edi
add rdi, 32 ; advance 32B
and rdi, -32 ; and round back to an alignment boundary
sub eax, edi ; how far rdi actually advanced
shr eax, 1
add rdx, rax ; advance dst by half that.
;; if rdi was aligned on entry, the it advances by 32 and rdx advances by 16. If it's guaranteed to always be aligned by 32, then simplify the code by removing this peeled unaligned iteration!
;; if not, the first aligned loop iteration will overlap some of the unaligned loads/store, but that's fine.
;; TODO: fold the above calculations into this other loop setup
lea rax, [rdx + rdx]
sub rdi, rax ; source = [rdi + 2*rdx], so we can just increment our dst pointer.
lea rax, [rdx + rcx] ; rax = end pointer. Assumes ecx was already zero-extended to 64-bit
; jmp .loop_entry ; another way to check if we're already done
; Without it, we don't check for loop exit until we've already copied 64B of input to 32B of output.
; If small inputs are common, checking after the first unaligned vectors does make sense, unless leaving it out makes the branch more predictable. (All sizes up to 32B have identical branch-not-taken behaviour).
ALIGN 16
.pack_loop:
; Use SSE4.1 movntdqa if reading from video RAM or other UCSW memory region
movdqa xmm0, [rdi + 2*rdx] ; indexed addressing mode is ok: doesn't need to micro-fuse because loads are already a single uop
movdqa xmm1, [rdi + 2*rdx + 16] ; these could optionally be movntdqa loads, since we got any unaligned source data out of the way.
pand xmm0, xmm7
pand xmm1, xmm7
packuswb xmm0, xmm1
movdqa [rdx], xmm0 ; non-indexed addressing mode: can micro-fuse
add rdx, 16
.loop_entry:
cmp rdx, rax
jb .pack_loop ; exactly 8 uops: should run at 1 iteration per 2 clocks
;; copies up to 15 bytes beyond the requested amount, depending on source alignment.
ret
With AVX's non-destructive 3rd operand encoding, the loads could be folded into the PANDs (vpand xmm0, xmm7, [rdi + 2*rdx]
). But indexed addressing modes can't micro-fuse on at least some SnB-family CPUs, so you'd probably want to unroll and add rdi, 32
as well as add rdx, 16
instead of using the trick of addressing the source relative to the destination.
AVX would bring the loop body down to 4 fused-domain uops for the 2xload+and/pack/store, plus loop overhead. With unrolling, we could start to approach Intel Haswell's theoretical max throughput of 2 loads + 1 store per clock (although it can't sustain that; store-address uops will steal p23 cycles instead of using p7 sometimes. Intel's optimization manual provides a real-world sustainable throughput number of something like ~84B loaded and stored per clock (using 32-byte vectors) assuming all L1 cache hits, which is less than the 96B peak throughput.)
You could also use a byte shuffle (SSSE3 PSHUFB) to get the even bytes of a vector packed into the low 64 bits. (Then do a single 64-bit MOVQ store for each 128-bit load, or combine two lower halves with PUNPCKLQDQ). But this sucks, because (per 128-bit vector of source data), it's 2 shuffles + 2 stores, or 3 shuffles + 1 store. You could make merging cheaper by using different shuffle masks, e.g. shuffle the even bytes to the low half of one vector, and the upper half of another vector. Since PSHUFB can also zero any bytes for free, you can combine with a POR (instead of a slightly more expensive PBLENDW or AVX2 VPBLENDD). This is 2 shuffles + 1 boolean + 1 store, still bottlenecking on shuffles.
The PACKUSWB method is 2 boolean ops + 1 shuffle + 1 store (less of a bottleneck because PAND can run on more execution ports; e.g. 3 per clock vs. 1 per clock for shuffles).
AVX512BW (available on Skylake-avx512 but not on KNL) provides
VPMOVWB ymm1/m256 {k1}{z}, zmm2
(__m256i _mm512_cvtepi16_epi8 (__m512i a)
), which packs with truncation instead of saturation. Unlike the SSE pack instructions, it takes only 1 input and produces a narrower result (which can be a memory destination). (vpmovswb
and vpmovuswb
are similar, and pack with signed or unsigned saturation. All the same size combos as pmovzx
are available, e.g. vpmovqb xmm1/m64 {k1}{z}, zmm2
, so you don't need multiple steps. The Q and D source sizes are in AVX512F).
The memory-dest functionality is even exposed with a C/C++ intrinsic, making it possible to conveniently code a masked store in C. (This is a nice change from pmovzx
where it's inconvenient to use intrinsics and get the compiler to emit a pmovzx
load).
AVX512VBMI (expected in Intel Cannonlake) can do two inputs to one 512b output with one VPERMT2B, given a shuffle mask that takes the even bytes from two input vectors and produces a single result vector.
If VPERM2TB is slower than VPMOVWB, using VPMOVWB for one vector at a time will probably be best. Even if they have the same throughput/latency/uop-count, the gain may be so small that it's not worth making another version and detection AVX512VBMI instead of AVX512BW. (It's unlikely that a CPU could have AVX512VBMI without having AVX512BW, although that is possible).