Why doesn't copy_user_enhanced_fast_string use AVX if it is available?

Question

When understanding profiling result of my application (I/O-heavy) I faced copy_user_enhanced_fast_string to be one of the hottest region. It is called when copying between user and kernel spaces. The implementation on x86 looks as:

ENTRY(copy_user_enhanced_fast_string)
    ASM_STAC
    cmpl $64,%edx
    jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
    movl %edx,%ecx
1:  rep
    movsb
    xorl %eax,%eax
    ASM_CLAC
    ret

    .section .fixup,"ax"
12: movl %ecx,%edx      /* ecx is zerorest also */
    jmp .Lcopy_user_handle_tail
    .previous

    _ASM_EXTABLE_UA(1b, 12b)
ENDPROC(copy_user_enhanced_fast_string)

Why wasn't vmovaps/vmovups used for that? Hasn't it proven that AVX has no performance advantage for copying where it is available?

"Some CPUs are adding enhanced REP MOVSB/STOSB instructions. It's recommended to use enhanced REP MOVSB/STOSB if it's enabled." — Ross Ridge, Dec 30 '19 at 05:00
Perhaps there is a way you can use mmap'ed IO and avoid the need to copy entirely. — TrentP, Dec 30 '19 at 05:19
@TrentP Probably it is a good idea to `mmap` the file and then `memcpy` it (using AVX) to some right place. Is that what you meant? — St.Antario, Jan 01 '20 at 17:39
Like @Peter Cordes said, the best way to use `mmap` would be to NOT copy the data, but to instead use it directly from the mapped location. mmap+copy won't be much better than read if at all. However, your question doesn't state you're accessing a file and there are plenty of other ways to get data from the kernel. Interfaces that have high throughput, e.g. video capture with V4L, usually have a mmap'ed mode that avoids copies. See also `sendfile()` and `splice()` for ways to avoid copies when using pipes and sockets. — TrentP, Jan 11 '20 at 20:08
@TrentP I measured using `mmap`ed along with non temporal stores and it turned out that minor page fault destroyed most of the performance benefits in comparison to `read` — St.Antario, Jan 11 '20 at 20:10
With mmap+copy, you have the added cost of page table manipulation to map the file into the process's address space, but save the system call mode switch overhead per read(). Any maybe the copy itself is faster. It could be faster or slower overall. The gain comes if you can not copy the data. — TrentP, Jan 11 '20 at 20:26

Peter Cordes · Accepted Answer · 2019-12-30T13:32:38.127

Kernel code can only safely use FPU / SIMD between kernel_fpu_begin() / kernel_fpu_end() to trigger an xsave (and xrstor before returning to user-space). Or xsaveopt or whatever.

That's a lot of overhead, and isn't worth it outside of a few rare cases (like md RAID5 / RAID6 parity creation / use.)

Unfortunately this means only GP-integer registers are available for most kernel code. The difference between an AVX memcpy loop and a rep movsb memcpy is not worth an xsave/xrstor on every system call.

Context switch vs. just entering the kernel background:

In user-space, the kernel handles state save/restore on context switches between user-space tasks. In the kernel, you want to avoid a heavy FPU save/restore every time you enter the kernel (e.g. for a system call) when you're about to return to the same user-space, so you just save the GP-integer regs.

For known-size copies, not having SSE/AVX is not too bad, especially on CPUs with the ERMSB feature (which is when this copy function is used, hence the enhanced_fast_string in the name). For medium to large aligned copies, rep movsb is nearly as fast on Intel CPUs at least, and hopefully also AMD. See Enhanced REP MOVSB for memcpy. Or without ERMSB, at least with rep movsq + cleanup.

In a 64-bit kernel, GP integer regs are half the size of XMM regs. For small copies (below the kernel's 64-byte threshold), 8x GP-integer 8-byte load and 8-byte store should be pretty efficient compared to the overhead of a system call in general. 4x XMM load/store would be nice, but it's a tradeoff against saving FPU state.

Not having SIMD is significantly worse for strlen/strcpy where pcmpeqb is very good vs. a 4 or 8-byte at a time bithack. And SSE2 is baseline for x86-64, so an x86-64 kernel could depend on that without dynamic dispatch if not for the problem of saving FPU state.

You could in theory eat the SSE/AVX transition penalty and do like some bad Windows drivers and just manually save/restore the low 128 of a vector reg with legacy SSE instructions. (This is why legacy SSE instructions don't zero the upper bytes of the full YMM / ZMM). IDK if anyone's benchmarked doing that for a kernel-mode strcpy or strlen, or memcpy.

Can you get the idea of why it is required to `xsave`/`xrstor` in kernel space, but not in userspace? In userspace I can only recall avx-sse transition which can be solved with `vzeroupper` to avoid saving upper part of the avx registers. — St.Antario, Dec 30 '19 at 06:21
@St.Antario: in user-space, the kernel handles state save/restore on context switches between user-space tasks. In the kernel, you want to avoid a heavy FPU save/restore every time you enter the kernel (e.g. for a system call) when you're about to return to the same user-space, so you just save the GP-integer regs. It's totally unrelated to SSE/AVX transitions. — Peter Cordes, Dec 30 '19 at 06:25
Since in my case `copy_user_enhanced_fast_string` came from `read`ing regular files it seems possible to workaround it by `mmap`ing a relevant file pages and then `memcpy` (using `AVX2` on my laptop) them to some userspace allocated buffer. Is something like this a common thing to do? — St.Antario, Jan 01 '20 at 17:30
@St.Antario: yes, `mmap` can be somewhat faster than `read`. But do you need to memcpy? That might eat up much of the gains. Ideally you can use the mmapped region directly, or at least do *some* useful work during the copy. (e.g. count something, byte-swap if necessary, transpose, or whatever the first processing step normally is, even if it's normally a read-only step.) — Peter Cordes, Jan 02 '20 at 03:09

Why doesn't copy_user_enhanced_fast_string use AVX if it is available?

1 Answers1