Can Apache web server make use of CPU AVX instructions?

Question

I have replaced my old home server (i3-6100, 51W TDP, 3,7GHz, SSE4.1, SSE4.2, AVX2) with a thin client (Celeron J4105, 10W TDP, 1.5/2.5GHz turbo, SSE4.2).

Can Apache make use of CPU AVX instructions?

Peter Cordes · Accepted Answer · 2023-01-24T19:45:54.230

Glibc automatically uses AVX/AVX2 if available for memcpy, memcmp, strlen, and stuff like that, which is nice for small to medium-length strings hot in L1d or L2 cache. (e.g. maybe twice as fast for strings of 100B to 128KiB). For shorter strings, startup and cleanup overhead are a significant fraction. Hopefully apache doesn't spend a lot of time looping over strings.

There might possibly be some auto-vectorized loops inside apache itself if you compile with -O3 -march=native, but unlikely.

I doubt there's anything in Apache that would be worth manually dispatching based on CPUID (except for libc functions), so you probably won't find any AVX instructions in the apache binary on your i3 server if you check with a disassembler, unless it was specifically compiled for that machine or for AVX-capable machines. If the whole binary was compiled with AVX enabled, even scalar FP math would use instructions like vmovsd / vucomisd instead of movsd / ucomisd, so if you see any movsd it wasn't compiled that way.

See How to check if compiled code uses SSE and AVX instructions? and note the SIMD (packed) vs. scalar.

One interesting feature of AVX that's relevant for multithreaded programs: Intel recently documented that the AVX feature flag implies 16-byte aligned load/store is guaranteed atomic. (And I think AMD is planning to do so if they haven't already, since it's also true in practice on their CPUs.) Previously the only support for 16-byte lock-free atomics was via lock cmpxchg16b, meaning that pure-load cost as much as an RMW. GCC-compiled code can take advantage of this via libatomic, including via updates to a shared libatomic which dispatches to more efficient load/store functions on CPUs with AVX.

So anyway, cheaper lock-free atomics for objects the size of two pointers in 64-bit mode. Not a game-changer for code that doesn't spend a ton of time communicating between threads. And it doesn't help the kernel because you can't take advantage of it with -mgeneral-regs-only; 16-byte load/store require an XMM reg, unless cmpxchg16b without a lock prefix counts. But that could do a non-atomic RMW if the compare succeeds, so that's unusable.

Probably more relevant is that AVX2 support comes with faster memcpy inside the kernel, for copy_to_user (from the pagecache) for read system calls. rep movsb can work in 32-byte chunks internally in microcode, vs. 16-byte chunks on CPUs whose load/store data paths are only 16 bytes wide.

(AVX can be implemented on CPUs with 16-byte load/store paths, like Zen 1 and Ivy Bridge, but your i3 with AVX2 has 32-byte datapaths between execution units and L1d cache. https://www.realworldtech.com/haswell-cpu/5/)

AVX2 can help with some OpenSSL stuff, but probably nothing important for web serving.

Usually you'll be using AES for encryption, and both CPUs have AES-NI. AVX+AES does enable working on 32 bytes per instruction instead of 16, but IIRC that has to be on 2 separate blocks in parallel, not working twice as fast on one single AES stream. Still, Apache + OpenSSL might manage to take advantage of this.

There's also a possible speedup for MD5 or SHA512 using AVX2, if I recall correctly.

For SHA1 and SHA256, the new CPU has SHA-NI (new in Goldmont and Ice Lake. The J4105 is Goldmont+, but the old CPU is Skylake so it didn't have SHA-NI and had to do it manually with SIMD.) There is no VEX encoding of SHA1RNDS4 xmm or SHA256 acceleration instructions, let alone one which uses 256-bit vectors to go faster. If you use SHA512 for anything, then that will go somewhat faster with AVX2 than with SSE4.2, all else equal.

(And of course a Skylake would run the same asm faster clock-for-clock, with a wider front-end that's more robust against bottlenecks, and more throughput in the back-end. https://agner.org/optimize/ and https://uops.info/ - compare your old Skylake against your "new" Goldmont+. I put "new" in quotes because it launched at the end of 2017, only a couple years after your Skylake.)

Intel haven't had AVX support in their low-power cores until Gracemont, the E-cores in Alder Lake. IDK if/when they're planning a stand-alone low-power chip with only Gracemont cores to replace Tremont, and if they might include AVX in that.

Homer512 · Answer 2 · 2023-01-24T21:09:18.507

Out of curiosity, and since I have a Gentoo Linux system where I can simply compile Apache with -O3 -march=native, I tried looking at the disassembly to see whether AVX vector instructions are generated at all.

objdump -d --no-show-raw-insn --no-addresses \
      /usr/sbin/apache2 /usr/lib64/apache2/modules/*.so | 
    grep -oE '^\s+([[:alpha:]][[:alnum:]]*)+' |
    LC_ALL=C sort | uniq -c

This gives the following stats:

      3         vaddsd
      1         vcomisd
     23         vcomiss
      3         vcvtsd2ss
      9         vcvtsi2sd
      1         vcvtsi2sdq
     25         vcvtsi2ss
      2         vcvtsi2ssl
     11         vcvtsi2ssq
     51         vcvtss2sd
      5         vcvttsd2si
      2         vcvttss2si
      2         vcvttss2usi
      4         vcvtusi2sd
      1         vcvtusi2sdl
      1         vcvtusi2ss
      4         vcvtusi2ssl
      3         vcvtusi2ssq
      8         vdivsd
     28         vdivss
     19         vextracti128
      2         vextracti64x2
     15         vinserti128
    185         vmovaps
     74         vmovd
    585         vmovdqa
     28         vmovdqa64
   1510         vmovdqu
     55         vmovdqu8
    323         vmovq
     15         vmovsd
    113         vmovss
      8         vmulsd
     30         vmulss
     22         vpackuswb
     27         vpaddd
     16         vpaddq
      3         vpalignr
     29         vpand
     17         vpblendmq
      2         vpblendvb
      1         vpbroadcastd
     14         vpbroadcastq
      2         vpbroadcastw
      8         vpcmpeqb
      3         vpcmpeqd
     16         vpcmpeqq
     16         vpcmpneqq
      1         vpermi2w
     20         vpermq
      1         vpermt2d
      1         vpermt2q
      7         vpermt2w
      1         vpextrb
      5         vpextrq
     32         vpgatherdd
      8         vpinsrb
     44         vpinsrd
    249         vpinsrq
      3         vpmaxsd
      3         vpmaxsq
      3         vpminsd
      1         vpmovqd
      8         vpmovsxdq
     18         vpmovzxbw
     36         vpmovzxwd
      2         vpmuldq
     17         vpor
     28         vpshufb
      8         vpshufd
     24         vpslld
      8         vpsrld
     13         vpsrldq
      1         vpsrlq
     20         vpsrlw
      4         vpsubq
      1         vpternlogd
      1         vpunpcklbw
      2         vpunpckldq
      4         vpunpcklqdq
      4         vpunpcklwd
    317         vpxor
      1         vshufpd
      1         vshufps
     12         vucomiss
     12         vxorpd
     41         vxorps
    126         vzeroupper

So there is definitely some use. However, this doesn't prove that these instructions are executed or that they are effective at improving performance compared to the same at SSE2 or compiled without automatic vectorization.

I find it somewhat curious to see instructions such as vpgatherdd in use. That's not something I would expect a compiler to use on its own. I should also note that this is GCC-11.3.1 on an i7-11800H (Tiger Lake), so this uses AVX-512, not just AVX-1 or 2.

As noted by Peter, the more likely candidate for effective usage is in the libc. I might add that OpenSSL will also make use of AVX if available.

Interesting, maybe a header library is inlining some code that dispatches based on what CPU it's running on? Or maybe CRT startup code or some static libraries like libatomic or libgcc use `kortestd` for something? `vpblendmq` is also AVX-512, as is `vextracti64x2`, and `vpermt2d` and similar. Also FP to/from unsigned conversions like `vcvtusi2sd`, and `vpcmpneqq` (pre-AVX512, only signed-`gt` and `eq` predicates were available for SIMD-integer compares). What CPU do you have, and what compiler (version) did you use? — Peter Cordes, Jan 24 '23 at 19:19
GCC and/or clang do rarely invent `vpgatherdd` when auto-vectorizing, even something that doesn't look like a gather. Some of those times, it would be better if it hadn't! I know I remember seeing bad use of `vpgather` in compiler-generated code that wasn't from intrinsics. Maybe for a strided load or something when it did that instead of wide loads and shuffling? x86 doesn't have hardware race detection; it's fine to invent *loads* of data other threads might be writing, especially in the same cache line as data you do read. (valgrind race-detection might complain, though.) — Peter Cordes, Jan 24 '23 at 19:23
Does AVX help much for OpenSSL? Usually you'll be using AES, and the OP's new CPU does still have AES-NI. I guess maybe for MD5 or SHA512? But for SHA1 and SHA256, the new CPU has SHA-NI (new in Goldmont and Ice Lake. The J4105 is Goldmont+ https://en.wikichip.org/wiki/intel/microarchitectures/goldmont_plus, but the old CPU is Skylake so it didn't have SHA-NI and had to do it manually with SIMD.) I don't think there is a VEX encoding of `SHA1RNDS4 xmm`, let alone one which uses 256-bit vectors to go faster. — Peter Cordes, Jan 24 '23 at 19:34
I updated my own answer with the OpenSSL comment. (Plus the fact that AVX+AES can give speedups if the use-case lines up with doing 2 blocks in parallel, possibly from different streams). — Peter Cordes, Jan 24 '23 at 19:53
@PeterCordes regarding my platform: gcc-11.3.1 on an Intel i7-11800H. Also, I'm an idiot. That CPU has AVX512 so I'm updating my answer — Homer512, Jan 24 '23 at 20:31
Ah yes, 11800H is Tiger Lake, the generation before Intel decided to lock AVX-512 away for market segmentation / profit margins reasons among other things on client chips like Alder Lake even with the E-cores disabled. Your CPU actually having AVX-512 was definitely high on my list of guesses at the reason for that many different AVX-512 instructions in the binary itself :P — Peter Cordes, Jan 24 '23 at 21:06

Can Apache web server make use of CPU AVX instructions?

2 Answers2