0

I current run BOINC across a number of servers which have GPUs.

The servers run both GPU and CPU BOINC apps.

As AVX and SSE slow down the CPU freq when being used within a CPU app, I have to be selective which CPU/GPU I run together, as some GPU apps get bottle necked (slower run time completion) where as others do not.

At present some CPU apps are named so it is clear to see if they use AVX but most are not.

Therefore is there any command I can run, and some way of viewing, to see if any of the CPU apps currently running are using AVX or SSE (any versions)?

Also as a side note, should I treat any FMA usage in the same way (eg does it slow down the CPU freq due to increased CPU temps)?

Thanks

Chris
  • 81
  • 12
  • 1
    *As AVX and SSE slow down the CPU freq* Not true for SSE, except for thermal / power limits. The occasional SSE instruction won't ever hurt (unlike the occasional 256-bit AVX instruction on some CPUs like Haswell). And if you have heavy use of SSE, the clock speed penalty is probably less bad than running twice as many scalar instructions. And yes, you should just treat FMA as any other SIMD FP instruction, the same as you would `vmulps` or `vaddps`. 128-bit AVX instructions are fine, though; you could safely compile with `gcc -O3 -march=native -mprefer-vector-width=128` – Peter Cordes Feb 20 '20 at 22:27
  • There are perf counters for SIMD FP math; those the main things that require reducing the max turbo. See [How do I monitor the amount of SIMD instruction usage](//stackoverflow.com/q/60104698); this is maybe a duplicate. – Peter Cordes Feb 20 '20 at 22:30
  • Please stay on point and dont answer your own question. The CPU apps run at 100% load for up to 30 days continuous (such as CPDN). The ones that I know use AVX such as Rosetta@home which does protein folding, yes straight away slows down the frequency (way before it has time to hit any thermal constraints). But, this question is about being able to see if a given CPU app is using either of these instruction sets, not about compiling apps or your opinion on if an app will decrease freq due to temps. – Chris Feb 20 '20 at 22:32
  • Yes, like I said, 256-bit AVX instructions can reduce turbo frequency right away. (So can peak power/current delivery limits, even before temperature limits require reductions.) But even so, you probably only need to worry about 256-bit AVX, not SSE. – Peter Cordes Feb 21 '20 at 00:42

1 Answers1

2

You can use perf top to see the number of AVX and SSE instructions executed in real-time along with executable and shared library names:

perf top -e fp_arith_inst_retired.128b_packed_single -e fp_arith_inst_retired.128b_packed_double -e fp_arith_inst_retired.256b_packed_single -e fp_arith_inst_retired.256b_packed_double

Counter descriptions (from perf list output on Intel Coffee Lake CPU):

floating point:
  fp_arith_inst_retired.128b_packed_double          
       [Number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired. Each count represents 2 computations. Applies to SSE* and AVX*
        packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform
        multiple calculations per element]
  fp_arith_inst_retired.128b_packed_single          
       [Number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired. Each count represents 4 computations. Applies to SSE* and AVX*
        packed single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they
        perform multiple calculations per element]
  fp_arith_inst_retired.256b_packed_double          
       [Number of SSE/AVX computational 256-bit packed double precision floating-point instructions retired. Each count represents 4 computations. Applies to SSE* and AVX*
        packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform
        multiple calculations per element]
  fp_arith_inst_retired.256b_packed_single          
       [Number of SSE/AVX computational 256-bit packed single precision floating-point instructions retired. Each count represents 8 computations. Applies to SSE* and AVX*
        packed single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they
        perform multiple calculations per element]
  fp_arith_inst_retired.scalar_double               
       [Number of SSE/AVX computational scalar double precision floating-point instructions retired. Each count represents 1 computation. Applies to SSE* and AVX* scalar double
        precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element]
  fp_arith_inst_retired.scalar_single               
       [Number of SSE/AVX computational scalar single precision floating-point instructions retired. Each count represents 1 computation. Applies to SSE* and AVX* scalar single
        precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform multiple calculations
        per element]
  fp_assist.any                                     
       [Cycles with any input/output SSE or FP assist]
Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271