Dynamically determining where a rogue AVX-512 instruction is executing

Question

I have a process running on an Intel machine that supports AVX-512, but this process doesn't directly use any AVX-512 instructions (asm or intrinsics) and is compiled with -mno-avx512f so that the compiler doesn't insert any AVX-512 instructions.

Yet, it is running indefinitely at the reduced AVX turbo frequency. No doubt there is an AVX-512 instruction sneaking in somewhere, via a library, (very unlikely) system call or something like that.

Rather than try to "binary search" down where the AVX-512 instruction is coming from, is there some way I can find it immediately, e.g., trapping on such an instruction?

OS is Ubuntu 16.04.

`-mno-avx512f` should automatically disable avx512cd/pf/er/etc, right? Have you tried grepping through `objdump -d` on your executable and its library dependencies? — that other guy, Aug 24 '18 at 17:14
You could maybe have the kernel clear the control-register bit that enables AVX512, and promises that full ZMM state will be saved/restored on context switches. But are you *sure* that sustained 256-bit FMAs or whatever aren't bringing it down to the same frequency as an occasional 512-bit instruction? I guess you've ruled out code in another process happening to slow down the core you're running on? — Peter Cordes, Aug 24 '18 at 17:15
Ubuntu 16.04 is old enough that I wouldn't expect ZMM usage in glibc memset/memcpy/strchr functions. They do perform runtime CPU detection, though. — Peter Cordes, Aug 24 '18 at 17:16
[How to check if a binary requires SSE4 or AVX on Linux](https://superuser.com/q/726395/173513). One of the answers includes a bash script. You may need to run the script on dependent libraries. `ldd ` should return a list of library names. The names should be OK but the paths may be off depending on your environment. — jww, Aug 24 '18 at 17:47
Out of morbid curiosity, how can you have a binary that supports AVX-512 but not use instructions from the ISA? — jww, Aug 24 '18 at 17:48
@thatotherguy - `-mno-avx512f` only disables AVX-512 in the code I'm compiling and that seems to be working (no AVX-512 in the generated code). Libraries, statically or dynamically linked, however, might have AVX-512. The problem with grepping is that it only gives a static view of what's in there, not where/why the path is actually executed. For example, it might be normal to have a `memcpy` somewhere that uses AVX-512, but not to expect your program to actually call it. — BeeOnRope, Aug 24 '18 at 19:04
@PeterCordes - this CPU does not have HT so there should be no other process running in parallel and I also don't expect to have other processes scheduled on this CPU as the machine is idle. Other processes "work as expected" (i.e., run at full scalar frequency). — BeeOnRope, Aug 24 '18 at 19:05
Some CPUs have all cores locked to the same frequency. Maybe not any SKX though? But if other processes are reliably going to full clock speed we can rule out interference from another process. — Peter Cordes, Aug 24 '18 at 19:09
Everything is working as expected on this CPU but a particular process seems to run at the AVX-512 frequency even though it shouldn't have AVX-512 instructions. I didn't check whether all cores are locked the same frequency, but it seems unlikely for SKX (it's a W-2401). — BeeOnRope, Aug 24 '18 at 19:11
This is quite remarkable because the AVX-512 frequency is only active with heavy AVX-512 code which contains FP and/or int-mul instructions, see [here](https://www.servethehome.com/wp-content/uploads/2017/07/Intel-Skylake-SP-Microarchitecture-AVX2-AVX-512-Clocks.jpg). I wouldn't expect these instructions in a `memcpy` function, for example. Light AVX-512 code should run at AVX2 frequencies. — wim, Aug 24 '18 at 21:23
@PeterCordes With respect to Ubuntu 16.04: A while ago I compiled a piece of non AVX-512 code with `-static` on Ubuntu 16.04. Indeed the `objdump` showed zmm registers and AVX-512 code (`vmov`-s), although 16.04 is quite old. — wim, Aug 24 '18 at 21:34
But note that the turbo frequency behavior might differ a bit between Skylake-SP, Skylake-X and Skylake-W. The link in my previous comment was related to Skylake-SP. I don't know if it applies here. — wim, Aug 24 '18 at 21:41
@wim - I misspoke above: this process is running at the middle speed tier, aka "AVX2 turbo" - but I find that poorly named because it includes actually a few heavy AVX/AVX2 instructions and the vast majority of AVX-512 instructions. — BeeOnRope, Aug 25 '18 at 15:41
@jww - thanks for the link, but those are about static analysis. I'm actually asking for a "runtime" approach, i.e,. determining when an AVX-512 instruction is actually executed at runtime. Static analysis gives both false positives and false negatives in that case: a binary many contain AVX-512 but they may not actually be executed in any given invocation, and static analysis can miss AVX-512 instructions that come from dynamically loaded libraries, runtime generated code or other things like runtime-decompressed code. — BeeOnRope, Aug 26 '18 at 02:18
Btw, the AVX(512) downclocks can be triggered from speculation. So you don't even need to execute an AVX instruction. So code that tries to be smart about running heavy AVX to avoid the clock-downs can be defeated by bad speculation. Needless to say, this is one of the Spectre exploits. — Mysticial, Aug 26 '18 at 10:23
This might be a good read: https://www.realworldtech.com/forum/?threadid=179700&curpostid=179700 — Mysticial, Aug 26 '18 at 10:27
@Mysticial It is! I created this question in an effort to have an easy out-of-the-box way to find AVX-512 instructions that might be "dirtying the uppers" so to speak. — BeeOnRope, Aug 26 '18 at 19:57
@BeeOnRope Thinking back, I've never encountered this. Both MSVC and ICC will unconditionally insert `vzeroupper`s into every function that has any AVX. Also, much of the code will be running all the way down at the AVX512 speed anyway. — Mysticial, Aug 27 '18 at 20:40
My suggestion is to use `perf record` to count the following three events: `CORE_POWER.LVL0_TURBO_LICENSE`, `CORE_POWER.LVL1_TURBO_LICENSE`, and `CORE_POWER.LVL2_TURBO_LICENSE`. Then `perf report` will break it down per ELF image. Doing something like that would enable you to pin down the ELF image. Then that can be followed by static binary analysis. Although I have not used these perf events before. — Hadi Brais, Aug 27 '18 at 23:56
@HadiBrais - I will try, but it doesn't seem that promising. This only tells you the places you happen to be running in the various licenses, not the instruction that kicked it off, unless perhaps you can "edge" trigger it. — BeeOnRope, Aug 28 '18 at 00:09
@BeeOnRope Yeah, but I hope the absolute counts would be useful. I'm assuming also that the number of samples may correlate with counter increments. The other suggestion I have may require a little effort, which is to use dynamic binary instrumentation on your process. This will tell you everything about the process. — Hadi Brais, Aug 28 '18 at 00:13
Maybe search your libraries and set the AVX-512 instructions to be breakpoints or tracepoints. Then run the program with a debugger and see which ones you hit. — Bobby Durrett, Aug 29 '18 at 23:46
Note that the similar issue of dirty upper bits of ymm registers, which are causing bad SSE performance on Skylake, reported here: [Why is this SSE code 6 times slower without VZEROUPPER on Skylake?](https://stackoverflow.com/q/41303780/2439725), existed in Ubuntu 16.04. In Ubuntu 18.04.1 this problem seems to be solved. At least I cannot reproduce it anymore since upgrading to 18.04.1. — wim, Aug 30 '18 at 13:35
@wim - yes, I ultimately tracked it down to the same issue. It's fixed in glibc 2.23 upstream, which is the version that Ubuntu uses, but Ubuntu (Debian, probably) apparently hasn't pulled in the fixes yet. — BeeOnRope, Aug 31 '18 at 21:45
Can you get GDB to produce a dynamic trace of instructions executed while single-stepping? Then search that for `zmm[0-3]`. — Peter Cordes, Sep 01 '18 at 17:20
Not sure if this is related https://stackoverflow.com/q/43256496/2542702 — Z boson, Sep 03 '18 at 06:56

Jérôme Pouiller · Answer 1 · 2018-09-17T14:34:14.950

As suggested in comments, you may search all ELF files of your system and disassemble them in order to check if they use AVX-512 instructions:

$ objdump -d /lib64/ld-linux-x86-64.so.2 | grep %zmm0
14922:       62 f1 fd 48 7f 44 24    vmovdqa64 %zmm0,0xc0(%rsp)
14a2d:       62 f1 fd 48 6f 44 24    vmovdqa64 0xc0(%rsp),%zmm0
14c2c:       62 f1 fd 48 7f 81 50    vmovdqa64 %zmm0,0x50(%rcx)
14ca0:       62 f1 fd 48 6f 84 24    vmovdqa64 0x50(%rsp),%zmm0

(BTW, libc and ld.so do include AVX-512 instructions, they are not the ones you are looking for?)

However, you may find binary that you do not even execute and miss code dynamically uncompressed, etc...

If you have a doubt on process (because perf report CORE_POWER.LVL*_TURBO_LICENSE events), I suggest to generate a core-dump if this process and disassemble it (notice first line allows to also dump code):

$ echo 0xFF > /proc/<PID>/coredump_filter 
$ gdb --pid=<PID>
[...]
(gdb) gcore
Saved corefile core.19602
(gdb) quit
Detaching from program: ..., process ...
$ objdump -d core.19602 | grep %zmm0
7f73db8187cb:       62 f1 7c 48 10 06       vmovups (%rsi),%zmm0
7f73db818802:       62 f1 7c 48 11 07       vmovups %zmm0,(%rdi)
7f73db81883f:       62 f1 7c 48 10 06       vmovups (%rsi),%zmm0
[...]

Next, you can easily write a small python script to add a breakpoint (or a tracepoint) on every AVX-512 instructions. Something like

(gdb) python
>import os
>with os.popen('objdump -d core.19602 | grep %zmm0 | cut -f1 -d:') as pipe:
>    for line in pipe:
>         gdb.Breakpoint("*" + line)

Sure it will create multiple hundreds (or thousands) of breakpoints. However, overhead of a breakpoint is small enough for gdb to support that (I think <1kB for each breakpoint).

One another way would be to run your code in a a virtual machine. Especially, I suggest libvex. libvex is used to dynamically instrument code (memory leak, memory profiling, etc..). libvex interpret machine code, translate it to an intermediate representation and re-encode machine code for CPU execution. The most famous project using libvex is valgrind (to be fair, libvex is back-end of valgrind).

Therefore, you can run your application with libvex without any instrumentation with:

$ valgrind --tool=none YOUR_APP

Now you have to write a tool around libvex in order to detect AVX-512 usage. However, libVEX does NOT (yet) support AVX-512. So, as soon as it have to execute AVX-512 instruction, it will fail with an Illegal instruction.

$ valgrind --tool=none YOUR_APP
[...]   
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x48 0x28 0x84 0x24 0x8 0x1 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==20061== valgrind: Unrecognised instruction at address 0x10913e.
==20061==    at 0x10913E: main (in ...)
==20061== Your program just tried to execute an instruction that Valgrind
==20061== did not recognise.  There are two possible reasons for this.
==20061== 1. Your program has a bug and erroneously jumped to a non-code
==20061==    location.  If you are running Memcheck and you just saw a
==20061==    warning about a bad jump, it's probably your program's fault.
==20061== 2. The instruction is legitimate but Valgrind doesn't handle it,
==20061==    i.e. it's Valgrind's fault.  If you think this is the case or
==20061==    you are not sure, please let us know and we'll try to fix it.
==20061== Either way, Valgrind will now raise a SIGILL signal which will
==20061== probably kill your program.
==20061== 
==20061== Process terminating with default action of signal 4 (SIGILL)
==20061==  Illegal opcode at address 0x10913E
==20061==    at 0x10913E: main (in ...)
==20061==

Note: this answer has been tested with:

#include <immintrin.h>
int main(int argc, char *argv[]) {
    __m512d a, b, c;
    _mm512_fnmadd_pd(a, b, c);
}

Does `libvex` virtualize CPUID to not report AVX512 support? I think the OP would need a virtual machine that *did* report AVX512 support, so libraries would still feel free to use AVX512 (and leave it in a polluted state). — Peter Cordes, Sep 12 '18 at 16:02
@Peter - yes libvex reports no support for AVX-512 via could. — BeeOnRope, Sep 13 '18 at 06:59
EDIT: once you have list of AVX512 instruction addresses, you can place breakpoint on each of them. I updated answer with this idea. — Jérôme Pouiller, Sep 17 '18 at 14:19

Dynamically determining where a rogue AVX-512 instruction is executing

1 Answers1

Linked