memcpy moving 128 bit in linux

Question

I'm writing a device driver in linux for a PCIe device. This device driver performs several read and write to test the throughput. When I use the memcpy, the maximum payload for a TLP is 8 bytes ( on 64 bits architectures ). In my opinion the only way to get a payload of 16 bytes is to use the SSE instruction set. I've already seen this but the code doesn't compile ( AT&T/Intel syntax issue ).

There is a way to use that code inside linux ?
Does anyone know where I can found an implementation of a memcpy that moves 128 bits ?

http://stackoverflow.com/questions/9347909/can-i-use-intel-syntax-of-x86-assembly-with-gcc — Matteo Italia, Nov 30 '15 at 16:19
MMX is only 64 bits (bytes). ITYM SSE, which is 128 bits (16 bytes). — Paul R, Nov 30 '15 at 16:35
I'm not really knowledgeable on this topic, but can't you use DMA for this ? — ElderBug, Nov 30 '15 at 16:37
@PaulR yes, sorry. My knowledge of assembly and device programming is really poor. — haster8558, Nov 30 '15 at 16:37
@ElderBug Yes, I can use for big chunk of data. But I want to find a tradeoff between DMA and CPU transfer. — haster8558, Nov 30 '15 at 16:38
@MatteoItalia I've seen but I didn't succeed to compile. I've tried to add the options -masm=intel, but it doesn't work. — haster8558, Nov 30 '15 at 16:56
@haster8558: "doesn't work" doesn't mean anything. Anyhow, unless you are compiling just your code (and not using any header which has `asm` blocks) you should use `.intel_syntax noprefix` at the start of your asm block and `.att_syntax noprefix` at the end. — Matteo Italia, Nov 30 '15 at 17:04
http://stackoverflow.com/questions/26246040/whats-missing-sub-optimal-in-this-memcpy-implementation/26256216#26256216 — Z boson, Nov 30 '15 at 18:58
You should investigate whether the Linux kernel has it's own fast memcpy function you can use. It may even be tuned to the particular CPU it runs on at boot time. — Ross Ridge, Nov 30 '15 at 21:12
@RossRidge: great point. Linux's memcpy looks pretty good for CPUs where `rep movs` doesn't suck. Otherwise it's *maybe* worth saving / restoring the vector regs so you can use an SSE loop. — Peter Cordes, Dec 01 '15 at 03:07
What do you mean by TLP? You mean Transaction Layer Packet? We use too many acronyms. — Z boson, Dec 01 '15 at 08:00

skyking · Accepted Answer · 2015-12-01T06:29:43.060

First of all you probably use GCC as the compiler and it uses the asm statement for inline assembler. When using that you will have to use a string literal for the assembler code (which will be copied into the assembler code before sending to the assembler - this means that the string should contain newline characters).

Second you will probably have to use AT&T syntax for the assembler.

Third GCC uses extended asm to pass variables between assembler and C.

Fourth you should probably avoid inline assembler when possible anyway as the compiler wont have the possibility to schedule instructions past an asm statement (this was true at least). Instead you could maybe make use of GCC extensions like the vector_size attribute:

typedef float v4sf __attribute__((vector_size(16)));

void fubar( v4sf *p, v4sf* q )
{
  v4sf p0 = *p++;
  v4sf p1 = *p++;
  v4sf p2 = *p++;
  v4sf p3 = *p++;

  *q++ = p0;
  *q++ = p1;
  *q++ = p2;
  *q++ = p3;
}

has the advantage that the compiler will produce code even if you compile for a processor that doesn't have the mmx registers, but perhaps some other 128-bit registers (or doesn't have vector registers at all).

Fifth you should investigate if the provided memcpy isn't fast enough. Often the memcpy is really optimized.

Sixth you should take precaution if you're using special registers in the Linux kernel, there are registers that aren't saved during context switch. The SSE registers are a part of these.

Seventh as you using this to test throughput you should consider if the processor is a significant bottleneck in the equation. Compare the actual execution of the code with the reads from/writes to RAM (do you hit or miss the cache?) or the reads from/write to the peripheral.

Eighth when moving data you should avoid moving big chunks of data from RAM to RAM and if it's to/from a peripheral that has limited bandwidth you should definitely consider using DMA for that. Remember that if it's access time that limits the performance the CPU will still be considered busy (although it can't run at 100% speed).

You can't use SSE in the Linux kernel without taking extra precautions. Vector regs aren't saved/restored unless they have to be, because normal kernel code doesn't touch them. — Peter Cordes, Dec 01 '15 at 03:04
@PeterCordes I wasn't considering that much that this were in kernel space, I've updated the answer to point that out. — skyking, Dec 01 '15 at 06:21
Just few clarification. The problem isn't the speed of the standard memcpy, the problem is that standard memcpy doesn't allow transfer of 16 bytes on PCIe bus. The bottleneck is the CPU, I'm sure about this because I'm using a protocol analyzer. In windows, on the same hardware, I can see a payload of 16 bytes, in linux I don't. The only idea I got is SSE instruction set. — haster8558, Dec 01 '15 at 07:36
@haster8558 Have you ruled out the possibility that the RAM may be a bottle neck? — skyking, Dec 01 '15 at 08:05
There isn't an evidence that the RAM could be involved in the transfer. Then the write throughput for a PCIe X1 is almost 70 MB/s on linux, and it almost double in windows and the hardware is the same. On the same hardware I want the same behaviour. The bottleneck is the CPU indeed. — haster8558, Dec 01 '15 at 08:30

score 5 · Answer 2 · edited May 23 '17 at 12:02

5

Leaving this answer here for now, even though it's now clear the OP just wants a single 16B transfer. On Linux, his code is causing two 8B transfers over the PCIe bus.

For writing to MMIO space, it's worth trying movnti write-combining-store instructions. The source operand for movnti is a GP register, not a vector reg.

You can probably generate that with intrinsics, if you #include <immintrin.h> in your driver code. That should be fine in the kernel, as long as you're careful about what intrinsics you use. It doesn't define any globals.

So most of this section isn't very relevant.

On most CPUs (where rep movs is good), Linux's memcpy uses it. It only uses a fallback to an explicit loop for CPUs where rep movsq or rep movsb are not good choices.

When the size is a compile-time-constant, memcpy has an inline implementation using rep movsl (AT&T syntax for rep movsd), then for cleanup: non-rep movsw and movsb if needed. (Actually kinda clunky, IMO, since the size is a compile-time constant. Also doesn't take advantage of fast rep movsb on CPUs that have it.)

Intel CPUs since P6 have had at least fairly good rep movs implementations. See Andy Glew's comments on it.

But still, you're wrong about memcpy only moving in 64bit blocks, unless I'm misreading the code or you're on a platform where it decides to use the fallback loop.

Anyway, I don't think you're missing out on much perf by using the normal Linux memcpy, unless you've actually single-stepped your code and seen it doing something silly.

For large copies, you'll want to set up DMA anyway. CPU usage by your driver is important, not just the max throughput you can obtain on an otherwise-idle system. (Be careful of trusting microbenchmarks too much.)

Using SSE in the kernel means saving/restoring the vector registers. It's worth it for the RAID5/RAID6 code. That code may only run from a dedicated thread, rather than from contexts where the vector/FPU registers still have another process's data.

Linux's memcpy can be used from any context, so it avoids using anything but the usual integer registers. I did find an article about an SSE kernel memcpy patch, where Andi Kleen and Ingo Molnar both say it wouldn't be good to always use SSE for memcpy. Maybe there could be a special bulk-memcpy for big copies where it's worth saving the vector regs.

You can use SSE in the kernel, but you have to wrap it in kernel_fpu_begin() and kernel_fpu_end(). On Linux 3.7 and later, kernel_fpu_end() actually does the work of restoring FPU state, so don't use a lot of fpu_begin/fpu_end pairs in a function. Also note that kernel_fpu_begin disables pre-emption, and you must not "do anything that might fault or sleep".

In theory, saving just one vector reg, like xmm0, would be good. You'd have to make sure you used SSE, not AVX instructions, because you need to avoid zeroing the upper part of ymm0 / zmm0. You might cause an AVX+SSE stall when you return to code that was using ymm regs. Unless you want to do a full save of the vector regs, you can't run vzeroupper. And even to do that, you'd need to detect AVX support...

However, doing even this one-reg save/restore would require you to take the same precautions as kernel_fpu_begin, and disable pre-emption. Since you'd be storing to your own private save slot (prob. on the stack), rather than to task_struct.thread.fpu, I'm not sure that even disabling pre-emption is enough to guarantee that user-space FPU state won't be corrupted. Maybe it is, but maybe it isn't, and I'm not a kernel hacker. Disabling interrupts to guard against this, too, is probably worse than just using kernel_fpu_begin()/kernel_fpu_end() to trigger a full FPU state save using XSAVE/XRSTOR.

edited May 23 '17 at 12:02

Community

1
1

answered Dec 01 '15 at 03:02

Peter Cordes

328,167
45
605
847

Thanks, I try to explain better. I've a FPGA linked to my CPU through a PCIe bus. Between CPU and FPGA there is a PCIe Protocol Analyzer. I've seen in windows that when ( in the device driver ) I call a memcpy with a block size bigger that 16 bytes, the payload of memory write is 16 bytes. In linux I can't see the same thing. In windows I can see this behaviour only in device driver built with WDF. About the fast of memcpy I don't care, because CPU takes the same time to generate the TLP ( with 8 or 16 payload ) for the roocomplex, so there isn't any differences. – haster8558 Dec 01 '15 at 07:54
Of course a DMA transfer is the solution for big chunc of data, but I want to be able to transfer 16 byte with only one strike. The DMA has to be programmed. It means to send at least 2 TLP to the endpoint and the wait for an interrupt. – haster8558 Dec 01 '15 at 07:57
@haster8558: Right, for transfering just 16B, obviously you don't want to DMA that. Your question left out *way* too much detail for anyone to figure out what kind of thing you were doing. Everyone assumed you were talking about bulk transfers, not a single 128b. I don't write device drivers myself, so I don't even know what a TLP is. I'm not sure what happens when you `rep movsq` to or from MMIO space, rather than between two normal writeback memory regions. But it sounds like you get 64bit chunks :/ – Peter Cordes Dec 01 '15 at 08:01
@haster8558: have you tried using a kernel debugger to double-check that your memcpy is happening with `rep movsq`? Or disassemble your driver, if the memcpy is inlined. – Peter Cordes Dec 01 '15 at 08:03
1

It would be interested to see some experimental evidence. I tested glibc's memcpy and gcc's built-in memcpy and found it was easy to beat them in some cases. I basically redid the tests Agner Fog did in his optimizing C++ manual. I would be surprised if Linux's memcpy is much better. I'm willing to bet for sizes much larger than the last level cache that using non-temporal stores and multiple threads will beat Linux's memcpy. – Z boson Dec 01 '15 at 08:08
@Zboson: thanks for catching that editing error :P I don't think it makes sense to do a multi-threaded memcpy inside the kernel. There are probably very few cases where a really big memcpy happens (like maybe a really big `read(2)` system call?) BTW, `rep movs` has used stores that avoid the RFO-overhead since P6, according to Andy Glew (who designed it). – Peter Cordes Dec 01 '15 at 08:09
I think by TLP the OP means Transaction Layer Packet but I don't know for sure. – Z boson Dec 01 '15 at 08:13
Using `movnti` (NT store from a GP reg) for giant memcpy is interesting, but inlining a check for it into memcpy would slightly slow down the typical case. Since you have a test framework set up, how does `REP MOVSQ` or `REP MOVSB` compare to an SSE MOVNT loop, for very large buffers? `rep movs` has a huge advantage (for the kernel) that it's fast to decode, doesn't branch-mispredict, and can even inline without bloating the code. I think kernel code tends to run once every so often, rather than in a tight loop like a microbench would. – Peter Cordes Dec 01 '15 at 08:16
I am not advocating that Linux´s memcpy or for that matter gllibc´s should use multiple threads in general. Agner Fog´s memcpy in his asmlib does check the TLC size and use non-temporal stores if appropriate. I don´t have my test framework setup now so I won´t be able to test anytime soon. – Z boson Dec 01 '15 at 09:03
Yes. TLP is the packet sent in PCIe protocol. The problem is that CPU need time to access time to the rootcomplex and generate the TLP. This time is fixed and it's different for a write request and read request. This fixed time is almost constant and the payload could make the difference. – haster8558 Dec 01 '15 at 09:11
@haster8558: Have you tried `movnti` for stores to the MMIO region? The weakly-ordered aspect might let it do write-combining and read-combining with store buffers or load buffers. IIRC, `movnt` loads are only different from regular loads for memory regions that have a memory type other than writeback. (e.g. video RAM being uncacheable, or uncacheable-write-combining. https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers might be interesting. – Peter Cordes Dec 01 '15 at 11:30
Hrm, `movnti` is only available as a store. Only `movntdqa` can do "streaming" loads, so there's no point in using a single NT load that's already as wide as you want. It doesn't need to combine with anything. If there was a 64bit NT load you could use with a GP reg, you'd be all set. (Assuming this read-combining idea works at all.) **What instructions does the windows driver run to get 16B TLP packets?** – Peter Cordes Dec 01 '15 at 11:33
It is a memcpy. The driver in Windows use the WDF framework. It is almost ever slower than a WDM or Linux, but it perform a write of 16 bytes. – haster8558 Dec 01 '15 at 11:39
1

@haster8558: I meant machine code, not high-level source... I'm wondering if Windows has a trick to do it without SSE, or if the Windows driver does just use SSE. I've never written Windows kernel / driver code, and don't know anything about it. Try disassembling the relevant function in the windows driver. – Peter Cordes Dec 01 '15 at 11:41
Ah sorry, I actually a don't know because I'm not so deep in coding. I'm working with FPGA, so I'm a little bit out of my field. I can try to see what instructions call, but it'll take time. – haster8558 Dec 01 '15 at 11:49
A quick google for "windows driver SSE" found https://msdn.microsoft.com/en-us/library/windows/hardware/ff545910(v=vs.85).aspx, which says Windows drivers can use SSE. It might still be worth checking in case your driver compiled to code that manages to generate a 16B TLP some other way, but it's pretty likely that the provided `memcpy` simply uses SSE, since we now know it can. – Peter Cordes Dec 01 '15 at 12:02
Indeed it can, but the check performed by the framework are slow and it starts to use SSE only when the blocksize is bigger than 128 bytes, and not when it's bigger than 16. Maybe there is a tradeoff between execution speed and blocksize but I actually don't care. – haster8558 Dec 01 '15 at 13:08
@haster8558: Well maybe it **is** worth disassembling the memcpy in your driver, to see if it just calls a function or if it inlined a single SSE load/store (since the size is a compile-time-constant 16B, right?) – Peter Cordes Dec 01 '15 at 13:23
@haster8558: Normally MMIO regions are set as unchacheable (with MTRR or PAT), but according to [John McAlpin's blog post on MMIO accesses](http://sites.utexas.edu/jdm4372/2013/05/29/notes-on-cached-access-to-memory-mapped-io-regions/), you *can* sometimes get away with cached MMIO. But then you have to `clflush` manually when you want to see fresh data. – Peter Cordes Dec 02 '15 at 08:36
Given that the MMIO memory where your device is mapped is probably unchacheable, you probably can't get the CPU to do a prefetch. I was thinking it might be worth trying a `PREFETCHNTA` followed (several cycles later) by normal 64bit loads. The docs say it just hints the normal prefetch system, so it sounds like you can't get it to prefetch from a location where it wouldn't ever prefetch on its own. I was hoping this would trigger a 64B read-buffer fill, which you could then read with other non-SSE insns. – Peter Cordes Dec 02 '15 at 08:39
@PeterCordes At the moment I'm following a different way. I'm using an extern function inside my driver compiled with nasm, but I've some problem to link in proper way. After I'll try to disassmble that code. – haster8558 Dec 02 '15 at 08:41
@haster8558: how does an external function help, compared to gcc inline asm? You still can't use SSE registers. You can probably use write-combining stores with `movnti`, but IDK about loads; prob. impossible without clobbering an SSE register. For that, you apparently have to use `kernel_fpu_begin()` / `kernel_fpu_end()`, and avoid anything that can sleep in between. See http://stackoverflow.com/a/16068179/224132 for some stuff about what it does. – Peter Cordes Dec 02 '15 at 09:04
Help me to understand how it works everithing. I had several problem to compile the assembly. Now I use nasm to compile my own assembly just to transfer 128 bits. My idea is to use xmm0 register. I've read that the first float parameter of a function is stored in the xmm0 register. I define a function 'mmx_copy(float data, void *addr)'. In data there is the data to be transfered, in addr the destination address. In the assembly I define mmx_copy where it perform only movntdq [rdi], xmm0 – haster8558 Dec 02 '15 at 09:21
@haster8558: Just use C intrinsics since you don't know what you're doing with asm. See my edit, esp. the part about using `kernel_fpu_begin()` before clobbering any xmm registers. Putting your code that uses xmm registers in a separately-compiled nasm file doesn't have any effect on the need to save them properly. Also, why would you call your function `mmx_copy`, when you're using SSE, not MMX? Anyway, don't even make a function at all; that's not helpful unless you know asm but are trying to avoid gcc inline asm (which by default uses AT&T syntax, not Intel/NASM). – Peter Cordes Dec 02 '15 at 09:28

score 3 · Answer 3 · edited May 23 '17 at 12:18

The link you mentioned is using non-temporal stores. I have discussed this several times before, for example here and here. I would suggest your read those before proceeding further.

But if you really want to produce the inline assembly code in the link you mentioned here is how you do it: use intrinsics instead.

The fact that you cannot compile that code with GCC is exactly one of the reasons intrinsics were created. Inline assembly has to be written differently for 32-bit and 64-bit code and typically has different syntax for each compiler. Intrinsics solve all these issues.

The following code should compile with GCC, Clang, ICC, and MSVC in both 32-bit and 64-bit mode.

#include "xmmintrin.h"
void X_aligned_memcpy_sse2(char* dest, const char* src, const unsigned long size)
{
    for(int i=size/128; i>0; i--) {
        __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
        _mm_prefetch(src + 128, _MM_HINT_NTA);
        _mm_prefetch(src + 160, _MM_HINT_NTA);
        _mm_prefetch(src + 194, _MM_HINT_NTA);
        _mm_prefetch(src + 224, _MM_HINT_NTA);

        xmm0 = _mm_load_si128((__m128i*)&src[   0]);
        xmm1 = _mm_load_si128((__m128i*)&src[  16]);
        xmm2 = _mm_load_si128((__m128i*)&src[  32]);
        xmm3 = _mm_load_si128((__m128i*)&src[  48]);
        xmm4 = _mm_load_si128((__m128i*)&src[  64]);
        xmm5 = _mm_load_si128((__m128i*)&src[  80]);
        xmm6 = _mm_load_si128((__m128i*)&src[  96]);
        xmm7 = _mm_load_si128((__m128i*)&src[ 112]);

        _mm_stream_si128((__m128i*)&dest[   0], xmm0);
        _mm_stream_si128((__m128i*)&dest[  16], xmm1);
        _mm_stream_si128((__m128i*)&dest[  32], xmm2);
        _mm_stream_si128((__m128i*)&dest[  48], xmm3);
        _mm_stream_si128((__m128i*)&dest[  64], xmm4);
        _mm_stream_si128((__m128i*)&dest[  80], xmm5);
        _mm_stream_si128((__m128i*)&dest[  96], xmm6);
        _mm_stream_si128((__m128i*)&dest[ 112], xmm7);
        src  += 128;
        dest += 128;
    }
}

Note that src and dest need to be 16 byte aligned and that size needs to be a multiple of 128.

I don't, however, advice to use this code. In the cases when non-temporal stores are useful loop unrolling is useless and explicit pre-fetching is rarely ever useful. You can simply do

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++) {
        _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
    }
}

more details as to why can be found here.

Here is the assembly from the X_aligned_memcpy_sse2 function using intrinsics with GCC -O3 -S -masm=intel. Notice that it's essentially the same as here.

    shr rdx, 7
    test    edx, edx
    mov eax, edx
    jle .L1
.L5:
    sub rsi, -128
    movdqa  xmm6, XMMWORD PTR [rsi-112]
    prefetchnta [rsi]
    prefetchnta [rsi+32]
    prefetchnta [rsi+66]
    movdqa  xmm5, XMMWORD PTR [rsi-96]
    prefetchnta [rsi+96]
    sub rdi, -128
    movdqa  xmm4, XMMWORD PTR [rsi-80]
    movdqa  xmm3, XMMWORD PTR [rsi-64]
    movdqa  xmm2, XMMWORD PTR [rsi-48]
    movdqa  xmm1, XMMWORD PTR [rsi-32]
    movdqa  xmm0, XMMWORD PTR [rsi-16]
    movdqa  xmm7, XMMWORD PTR [rsi-128]
    movntdq XMMWORD PTR [rdi-112], xmm6
    movntdq XMMWORD PTR [rdi-96], xmm5
    movntdq XMMWORD PTR [rdi-80], xmm4
    movntdq XMMWORD PTR [rdi-64], xmm3
    movntdq XMMWORD PTR [rdi-48], xmm2
    movntdq XMMWORD PTR [rdi-128], xmm7
    movntdq XMMWORD PTR [rdi-32], xmm1
    movntdq XMMWORD PTR [rdi-16], xmm0
    sub eax, 1
    jne .L5
.L1:
    rep ret

Thank you so much. Just a doubt, one of my colleague told me that intrinsics couldn't be use inside a kernel mode. Do I misunderstanding ? — haster8558, Dec 01 '15 at 07:44
@haster8558, I don't know. Why did you your colleagues say that? The only thing I could imagine is that the compiler converted the intrinsic into instructions that are not compatible with the kernel. You have to look at the assembly produced. I guess that's one disadvantage of intrinsics in that you can't guarantee you get exactly the same code with different compilers or even different version of a compiler. — Z boson, Dec 01 '15 at 08:04
The problem is the xmmintrin.h include stdlib.h and there are several problem during compile. — haster8558, Dec 01 '15 at 08:33
is it possible to see the X_aligned_memcpy_sse2 disassembled in AT&T syntax ? — haster8558, Dec 01 '15 at 10:58
@haster8558, yes `gcc -O3 -S foo.c`. Note again that I don´t recommend using this code. I just wanted to illustrate how to do this with intrinsics. — Z boson, Dec 01 '15 at 11:01
My idea is to copy and paste the assembly and then compile as extern function. Is it a bad idea ? At the moment I'm trying to working in user space. When I'll see that I get a payload of 16 byets, I'll try to integrate inside the device driver. — haster8558, Dec 01 '15 at 11:04

memcpy moving 128 bit in linux

3 Answers3

Linked