SSE mov instruction that can skip every 2nd byte?

Question

I need to copy all the odd numbered bytes from one memory location to another. i.e. copy the first, third, fifth etc. Specifically I'm copying from the text area 0xB8000 which contains 2000 character/attribute words. I want to skip the attribute bytes and just end up with the characters. The following code works fine:

      mov eax, ecx                       ; eax = number of bytes (1 to 2000)
      mov rsi, rdi                       ; rsi = source
      mov rdi, CMD_BLOCK                 ; rdi = destination
@@:   movsb                              ; copy 1 byte
      inc rsi                            ; skip the next source byte
      dec eax
      jnz @b

The number or characters to be copied is anywhere from 1 to 2000. I've recently started playing with sse2, sse3 sse4.2 but can't find an instruction(s) that can reduce the looping. Ideally I would love to cut down the loops from 2000 to say 250 which would be possible if there was an instruction that could skip every 2nd byte, after loading 128 bits at a time.

Paul R · Accepted Answer · 2016-09-18T17:08:02.020

3

I would do something like this, processing 32 input bytes to 16 output bytes per loop iteration:

const __m128i vmask = _mm_set1_epi16(0x00ff);

for (i = 0; i < n; i += 16)
{
    __m128i v0 = _mm_loadu_si128(&a[2 * i]);      // load 2 x 16 input bytes (MOVDQU)
    __m128i v1 = _mm_loadu_si128(&a[2 * i + 16]);
    v0 = _mm_and_si128(v0, vmask);                // mask unwanted bytes     (PAND)
    v1 = _mm_and_si128(v1, vmask);
    __m128 v = _mm_packus_epi16(v0, v1);          // pack low bytes          (PACKUSWB)
    _mm_storeu_si128(v, &b[i];                    // store 16 output bytes   (MOVDQU)
}

This is C with intrinsics of course - if you really want to do this in assembler then you can just convert each intrinsic above into its corresponding instruction.

edited Sep 18 '16 at 17:08

answered Sep 18 '16 at 17:01

Paul R

208,748
37
389
560

1

yup, exactly what I was thinking. Looks better than any combination of PSHUFB, since it's only one shuffle per result vector, and shuffles have lower throughput than boolean bitwise ops. – Peter Cordes Sep 18 '16 at 17:11
I think it should be sufficient to do the packing step. – fuz Sep 18 '16 at 17:27
1

This is exactly what I was hoping to find. Much appreciated. – poby Sep 18 '16 at 17:44
1

@FUZxxl: I think you need to mask unless you know that the high bytes are always zero, since the pack operation is saturated. – Paul R Sep 18 '16 at 18:46

score 2 · Answer 2 · answered Sep 18 '16 at 17:45

2

I wouldn't use SIMD instructions at all. I doubt you can significantly beat the performance 64-bit loads since video memory is uncached and its unlikely that the bus supports wider transactions.

I'd use something like this:

     lea rdi, [rdi + rcx * 2 - 8]
loop:
     mov rax, [rdi]
     mov [CMD_BLOCK + rcx - 4], al
     shr rax, 16
     mov [CMD_BLOCK + rcx - 4 + 1], al
     shr rax, 16
     mov [CMD_BLOCK + rcx - 4 + 2], al
     shr rax, 16
     mov [CMD_BLOCK + rcx - 4 + 3], al
     sub rdi, 8
     sub rcx, 4
     jnz loop

It looks inefficient, but since there's a huge stall on the load (mov rax,[rdi]) everything else can happen in parallel with that.

Or in C:

void copy_text(void *dest, void *src, int len) {
    unsigned long long *sp = src;
    unsigned char *dp = dest;
    int i;

    for(i = 0; i < len; i += 4) {
        unsigned long long a = *sp++;
        *dp++ = (unsigned char) a;
        a >>= 16;
        *dp++ = (unsigned char) a;
        a >>= 16;
        *dp++ = (unsigned char) a;
        a >>= 16;
        *dp++ = (unsigned char) a;
    }
}

Regardless of what you do the performance of your code is going to be dominated by the cost of the uncached video memory reads. That's really the only part you need to optimize.

Also if you're doing a lot of these reads, and so the performance of the code actually matters, you should see if you can't keep a copy of the text in normal cached memory. Video memory isn't designed to be read from, so that should really should be a last resort. (Or if you're running this code in the Linux kernel or something, see if there's already a copy in normal memory you can access.)

answered Sep 18 '16 at 17:45

Ross Ridge

38,414
7
81
112

1

On (strongly-ordered) UC memory, you couldn't get a full cache line with NT loads like you could from (weakly ordered) USWC, but you could still get 16B in one load, right? Intel has an article about using MOVNTDQA loads from video mem: https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers. (They use NT stores to WB memory, with an extra trick of using a bounce buffer that stays cached to separate the NT loads from NT stores, reducing partial-line fills). – Peter Cordes Sep 18 '16 at 17:54
@PeterCordes Hmm... I wasn't aware of the MOVNTDQA instruction. It appears to allow the the processor to ignore the USWC attribute of the memory, and perform an entire cache-line load at once. For video memory that's actually in system RAM that should be a win (one burst transaction to DRAM), but I don't know if it would be a big improvement with reads across the PCI-Express bus. I'm not sure if greater than 64-bit reads that are initiated by the CPU are generally supported. – Ross Ridge Sep 18 '16 at 18:23
1

MOVNTDQA does *not* override the memory-ordering semantics, BTW. [See my answer here](http://stackoverflow.com/questions/32103968/non-temporal-loads-and-the-hardware-prefetcher-do-they-work-together). On strongly-ordered (WB) memory, it's still a strongly-ordered load. The CPU might be able to do something with the NT hint (like avoid cache pollution), though, so it might still be useful. I've only guessed, not tried to test, how it's implemented on modern Intel with large *inclusive* L3 cache tags. – Peter Cordes Sep 18 '16 at 18:31
@PeterCordes I'm not sure why you're bringing up memory-ordering semantics, but I was referring to this part of the article you linked: "Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay". – Ross Ridge Sep 19 '16 at 00:14
1

You said "ignore the USWC attribute", and I just wanted to make it clear (for future readers that maybe didn't look at the article) that memory-type attributes do have an impact on what MOVNTDQA does. But yes, it does trigger caching of data in otherwise-"uncacheable" memory. IIRC, it can't do this from UC memory because it's not weakly ordered, only from USWC memory which does imply weak ordering. I should have given UC as my example of strongly ordered memory in my previous comment. – Peter Cordes Sep 19 '16 at 00:18

Peter Cordes · Answer 3 · 2017-07-07T06:04:17.460

Are you really using SIMD on VGA text-mode video memory in x86-64 mode? This is amusing, but actually plausible in real life, and works as a use-case for some SIMD data manipulation.

However, if you're really reading from video memory then you might be doing uncached loads, which is bad and implies you should redesign your system so you don't have to do that. (See Ross's answer for suggestions)

On USWC video memory, you can get a big speedup from MOVNTDQA. See Intel's article, and a couple of my answers about NT loads: here and especially this one where I explain what the x86 ISA manuals say about NT loads not overriding the memory ordering semantics, so they're not weakly ordered unless you use them on weakly-ordered memory regions.

As you suspected, you won't find copy instructions in SIMD instruction sets; you have to do the data processing yourself in registers beween loads and stores. There isn't even a single SSE/AVX instruction that will do this for you. (ARM NEON's unzip instruction does solve the whole problem, though).

You should use SSE2 PACKUSWB, to pack two vectors of (signed) int16_t down to one vector of uint8_t. After zeroing the upper byte of each word element, saturating to 0..255 won't modify your data at all.

Here's a real (untested) loop that aligns the source pointer to minimize penalties from crossing cache-line boundaries, and uses some addressing-mode tricks to save instructions in the loop.

Unaligned loads have very little penalty on Nehalem and later, mostly just extra latency when they cross a cache-line boundary. So this is mostly useful if you want to use NT loads from video memory. Or this is maybe useful if you would otherwise read beyond the end of the src at the end of large copies.

We do twice as many loads as stores, so if load/store throughput was an issue aligned loads (instead of aligned stores) might be optimal. However, there's too much ALU work to saturate cache load/store throughput, so keeping it simple with unaligned loads (like Paul R's loop) should work very well on most CPUs and use-cases.

  mov       edx, CMD_BUFFER    ; or RIP-relative LEA, or hopefully this isn't even static in the first place and this instruction is something else

  ;; rdi = source   ; yes this is "backwards", but if you already have the src pointer in rdi, don't waste instructions
  ;; rcx = count
  ;; rdx = dest

  pcmpeqw   xmm7, xmm7         ; all ones (0xFF repeating)
  psrlw     xmm7, 8            ; 0x00FF repeating: mask for zeroing the high bytes

  ;cmp       ecx, 16
  ;jb        fallback_loop     ; just make CMD_BUFFER big enough that it's ok to copy 16 bytes when you only wanted 1.  Assuming the src is also padded at the end so you can read without faulting.

  ;; First potentially-unaligned 32B of source data
  ;; After this, we only read 32B chunks of 32B-aligned source that contain at least one valid byte, and thus can't segfault at the end.
  movdqu    xmm0, [rdi]             ; only diff from loop body: addressing mode and unaligned loads
  movdqu    xmm1, [rdi + 16]
  pand      xmm0, xmm7
  pand      xmm1, xmm7
  packuswb  xmm0, xmm1
  movdqu    [rdx], xmm0

  ;; advance pointers just to the next src alignment boundary.  src may have different alignment than dst, so we can't just AND both of them
  ;; We can only use aligned loads for the src if it was at least word-aligned on entry, but that should be safe to assume.
  ;; There's probably a way to do this in fewer instructions.
  mov       eax, edi
  add       rdi, 32                ; advance 32B
  and       rdi, -32               ; and round back to an alignment boundary
  sub       eax, edi               ; how far rdi actually advanced
  shr       eax, 1
  add       rdx, rax               ; advance dst by half that.

  ;; if rdi was aligned on entry, the it advances by 32 and rdx advances by 16.  If it's guaranteed to always be aligned by 32, then simplify the code by removing this peeled unaligned iteration!
  ;; if not, the first aligned loop iteration will overlap some of the unaligned loads/store, but that's fine.

  ;; TODO: fold the above calculations into this other loop setup

  lea       rax, [rdx + rdx]
  sub       rdi, rax           ; source = [rdi + 2*rdx], so we can just increment our dst pointer.

  lea       rax, [rdx + rcx]   ; rax = end pointer.  Assumes ecx was already zero-extended to 64-bit



  ; jmp      .loop_entry       ; another way to check if we're already done
  ; Without it, we don't check for loop exit until we've already copied 64B of input to 32B of output.
  ; If small inputs are common, checking after the first unaligned vectors does make sense, unless leaving it out makes the branch more predictable.  (All sizes up to 32B have identical branch-not-taken behaviour).

ALIGN 16
.pack_loop:

  ; Use SSE4.1  movntdqa  if reading from video RAM or other UCSW memory region
  movdqa    xmm0, [rdi + 2*rdx]         ; indexed addressing mode is ok: doesn't need to micro-fuse because loads are already a single uop
  movdqa    xmm1, [rdi + 2*rdx + 16]    ; these could optionally be movntdqa loads, since we got any unaligned source data out of the way.
  pand      xmm0, xmm7
  pand      xmm1, xmm7
  packuswb  xmm0, xmm1
  movdqa    [rdx], xmm0        ; non-indexed addressing mode: can micro-fuse
  add       rdx, 16
.loop_entry:
  cmp       rdx, rax
  jb        .pack_loop         ; exactly 8 uops: should run at 1 iteration per 2 clocks

  ;; copies up to 15 bytes beyond the requested amount, depending on source alignment.

  ret

With AVX's non-destructive 3rd operand encoding, the loads could be folded into the PANDs (vpand xmm0, xmm7, [rdi + 2*rdx]). But indexed addressing modes can't micro-fuse on at least some SnB-family CPUs, so you'd probably want to unroll and add rdi, 32 as well as add rdx, 16 instead of using the trick of addressing the source relative to the destination.

AVX would bring the loop body down to 4 fused-domain uops for the 2xload+and/pack/store, plus loop overhead. With unrolling, we could start to approach Intel Haswell's theoretical max throughput of 2 loads + 1 store per clock (although it can't sustain that; store-address uops will steal p23 cycles instead of using p7 sometimes. Intel's optimization manual provides a real-world sustainable throughput number of something like ~84B loaded and stored per clock (using 32-byte vectors) assuming all L1 cache hits, which is less than the 96B peak throughput.)

You could also use a byte shuffle (SSSE3 PSHUFB) to get the even bytes of a vector packed into the low 64 bits. (Then do a single 64-bit MOVQ store for each 128-bit load, or combine two lower halves with PUNPCKLQDQ). But this sucks, because (per 128-bit vector of source data), it's 2 shuffles + 2 stores, or 3 shuffles + 1 store. You could make merging cheaper by using different shuffle masks, e.g. shuffle the even bytes to the low half of one vector, and the upper half of another vector. Since PSHUFB can also zero any bytes for free, you can combine with a POR (instead of a slightly more expensive PBLENDW or AVX2 VPBLENDD). This is 2 shuffles + 1 boolean + 1 store, still bottlenecking on shuffles.

The PACKUSWB method is 2 boolean ops + 1 shuffle + 1 store (less of a bottleneck because PAND can run on more execution ports; e.g. 3 per clock vs. 1 per clock for shuffles).

AVX512BW (available on Skylake-avx512 but not on KNL) provides
VPMOVWB ymm1/m256 {k1}{z}, zmm2 (__m256i _mm512_cvtepi16_epi8 (__m512i a)), which packs with truncation instead of saturation. Unlike the SSE pack instructions, it takes only 1 input and produces a narrower result (which can be a memory destination). (vpmovswb and vpmovuswb are similar, and pack with signed or unsigned saturation. All the same size combos as pmovzx are available, e.g. vpmovqb xmm1/m64 {k1}{z}, zmm2, so you don't need multiple steps. The Q and D source sizes are in AVX512F).

The memory-dest functionality is even exposed with a C/C++ intrinsic, making it possible to conveniently code a masked store in C. (This is a nice change from pmovzx where it's inconvenient to use intrinsics and get the compiler to emit a pmovzx load).

AVX512VBMI (expected in Intel Cannonlake) can do two inputs to one 512b output with one VPERMT2B, given a shuffle mask that takes the even bytes from two input vectors and produces a single result vector.

If VPERM2TB is slower than VPMOVWB, using VPMOVWB for one vector at a time will probably be best. Even if they have the same throughput/latency/uop-count, the gain may be so small that it's not worth making another version and detection AVX512VBMI instead of AVX512BW. (It's unlikely that a CPU could have AVX512VBMI without having AVX512BW, although that is possible).

It's for a keyboard handler in a hobby os I'm writing so it's not mission critical but I like to learn and like to write the most efficient code I can especially using newer instructions I'm less familiar with. How slow are reads and writes to video memory? Hundreds of times slower than ram or 2 or 3 times slower? — poby, Sep 18 '16 at 23:19
@poby: Cool. I don't like inefficient code either. But since performance of the loop doesn't really matter, the best thing for *overall* performance in this case is probably to keep the code-size small, to reduce instruction-cache evictions. So maybe just always use unaligned loads/stores, especially if you don't need to avoid reading past the end. Or even do it scalar like Ross suggested. (Probably combining some bytes in a register for a wider store, though.) — Peter Cordes, Sep 18 '16 at 23:23
@poby: re: video memory. IDK, but if it's on a video card; hundreds or thousands of times higher latency, because it can't just hit in L1 cache. I think throughput can be ok *if* you do wide reads, especially if you use MOVNTDQA to get full cache line transfers. If it's in main memory (i.e. integrated graphics using memory physically attached to the CPU), then it's probably still marked uncacheable. Hundreds of times worse latency than a normal WriteBack memory region, but the same throughput as normal memory if you read with SSE4.1 NT loads. — Peter Cordes, Sep 18 '16 at 23:28
I feel like an idiot asking but what exactly is "micro-fuse"? I've googled and the term is mentioned a lot but nowhere can I find an explanation of what it is. — poby, Sep 19 '16 at 00:39
@poby: See [Agner Fog's microarch pdf](http://www.agner.org/optimize/). The micro-fusion SO question I linked does indirectly link there. It's when an ALU instruction with a memory source operand decodes as a single micro-fused uop on Intel hardware, instead of two separate uops even outside of the Reservation Station (unfused-domain uop scheduler). — Peter Cordes, Sep 19 '16 at 00:54
IMO there's no point writing in assembly if you're not specifically tuning for existing CPUs. Otherwise might as well use C (with SIMD intrinsics if it doesn't autovectorize) and let a compiler do it. But obviously you have to start learning somewhere, which is why I commented on the advanced stuff as well as the basic stuff in this answer. But once you know how to do the basics, that's the kind of stuff you should be considering when choosing instructions by hand. — Peter Cordes, Sep 19 '16 at 00:56
It's up to you to decide when you understand enough basics to start worrying about tuning for specific microarchitectures like that, but it's what I find interesting to write about in answers. This is why my answers get so long :P — Peter Cordes, Sep 19 '16 at 00:57
I'm retired, lots of time, miss the good old days in the 70's writing 6502 and z80 assembly. I'm bored with C so I've decided to write an os entirely in assembly because it will take more than a lifetime to finish. Which suits me as I hate finishing :p But I love learning and writing efficient code. I super appreciate your long answers btw. I've learnt a LOT in the past 24 hours! — poby, Sep 19 '16 at 01:03
@poby: I *highly* recommend that you read through Agner Fog's Optimizing Assembly guide. You will learn a crapton about what's actually efficient, and good idioms for doing lots of things. It's well-written, clear, and easy to read, with good examples. Some of the advice in that PDF is slightly dated, though, and doesn't strictly apply to Intel Sandybridge or especially Haswell and later CPUs. (e.g. partial-register writes don't produce any merging stalls on Haswell and later.) Agner mostly only finds time to update the microarch pdf for new CPUs, not rewrite much in the other guides. — Peter Cordes, Sep 19 '16 at 01:07

SSE mov instruction that can skip every 2nd byte?

3 Answers3