I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
#include <stdio.h>
#include <inttypes.h>
int main() {
const size_t N = 1000;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (size_t i = 0; i < N; ++i)
arr[i] *= 5;
for (size_t i = 0; i < N; ++i)
printf("%lu\n", arr[i]); // use the array so that it is not optimized away
}
gcc -O3 -fopt-info-vec-all -mavx2 main.c
:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: optimized: loop vectorized using 32 byte vectors
main.cpp:7:26: optimized: loop vectorized using 32 byte vectors
main.cpp:4:5: note: vectorized 2 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V4DI
main.cpp:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V4DI
gcc -O3 -fopt-info-vec-all -march=znver1 main.c
:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: missed: couldn't vectorize loop
main.cpp:10:26: missed: not vectorized: unsupported data-type
main.cpp:7:26: optimized: loop vectorized using 16 byte vectors
main.cpp:4:5: note: vectorized 1 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V2DI
main.cpp:15:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI
-march=znver1
includes -mavx2
, so I think gcc chooses not to vectorise it for some reason:
~ $ gcc -march=znver1 -Q --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mamx-bf16 [disabled]
-mamx-int8 [disabled]
-mamx-tile [disabled]
-mandroid [disabled]
-march= znver1
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [enabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bf16 [disabled]
-mavx512bitalg [disabled]
-mavx512bw [disabled]
-mavx512cd [disabled]
-mavx512dq [disabled]
-mavx512er [disabled]
-mavx512f [disabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [disabled]
-mavx512vnni [disabled]
-mavx512vp2intersect [disabled]
-mavx512vpopcntdq [disabled]
-mavxvnni [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mcldemote [disabled]
-mclflushopt [enabled]
-mclwb [disabled]
-mclzero [enabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-menqcmd [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [disabled]
-mhreset [disabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-minstrument-return= none
-mintel-syntax -masm=intel
-mkl [disabled]
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmanual-endbr [disabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [enabled]
-mneeded [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= 128
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mptwrite [disabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mrecord-return [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [disabled]
-msahf [enabled]
-mserialize [disabled]
-msgx [disabled]
-msha [enabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [enabled]
-msse5 -mavx
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtsxldtrk [disabled]
-mtune-ctrl=
-mtune= znver1
-muclibc [disabled]
-muintr [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwaitpkg [disabled]
-mwbnoinvd [disabled]
-mwidekl [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known choices for return instrumentation with -minstrument-return=:
call none nop5
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option):
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
Known valid arguments for -march= option:
i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 generic native
Known valid arguments for -mtune= option:
generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3
I also tried clang and in both cases the loops are vectorised by, I believe, 32 byte vectors:
remark: vectorized loop (vectorization width: 4, interleaved count: 4)
I'm using gcc 11.2.0
Edit: As requested by Peter Cordes I realised I was actually benchmarking with a multiplication by 4 for some time.
Makefile:
all:
gcc -O3 -mavx2 main.c -o 3
gcc -O3 -march=znver2 main.c -o 32
gcc -O3 -march=znver2 main.c -mprefer-vector-width=128 -o 32128
gcc -O3 -march=znver1 main.c -o 31
gcc -O2 -mavx2 main.c -o 2
gcc -O2 -march=znver2 main.c -o 22
gcc -O2 -march=znver2 main.c -mprefer-vector-width=128 -o 22128
gcc -O2 -march=znver1 main.c -o 21
hyperfine -r5 ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
clean:
rm ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
Code:
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <time.h>
int main() {
const size_t N = 500;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (int j = 0; j < 20000000; ++j)
for (size_t i = 0; i < N; ++i)
arr[i] *= 4;
srand(time(0));
printf("%lu\n", arr[rand() % N]); // use the array so that it is not optimized away
}
N = 500, arr[i] *= 4
:
Benchmark 1: ./3
Time (mean ± σ): 1.780 s ± 0.011 s [User: 1.778 s, System: 0.000 s]
Range (min … max): 1.763 s … 1.791 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.785 s ± 0.016 s [User: 1.783 s, System: 0.000 s]
Range (min … max): 1.773 s … 1.810 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.740 s ± 0.026 s [User: 1.737 s, System: 0.000 s]
Range (min … max): 1.724 s … 1.785 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.757 s ± 0.022 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.727 s … 1.785 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.467 s ± 0.031 s [User: 3.462 s, System: 0.000 s]
Range (min … max): 3.443 s … 3.519 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.475 s ± 0.028 s [User: 3.469 s, System: 0.001 s]
Range (min … max): 3.447 s … 3.512 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.464 s ± 0.034 s [User: 3.459 s, System: 0.001 s]
Range (min … max): 3.431 s … 3.509 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.465 s ± 0.013 s [User: 3.460 s, System: 0.001 s]
Range (min … max): 3.443 s … 3.475 s 5 runs
N = 500, arr[i] *= 5
:
Benchmark 1: ./3
Time (mean ± σ): 1.789 s ± 0.004 s [User: 1.786 s, System: 0.001 s]
Range (min … max): 1.783 s … 1.793 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.772 s ± 0.017 s [User: 1.769 s, System: 0.000 s]
Range (min … max): 1.755 s … 1.800 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.911 s ± 0.023 s [User: 2.907 s, System: 0.001 s]
Range (min … max): 2.880 s … 2.943 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 2.924 s ± 0.013 s [User: 2.921 s, System: 0.000 s]
Range (min … max): 2.906 s … 2.934 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.850 s ± 0.029 s [User: 3.846 s, System: 0.000 s]
Range (min … max): 3.823 s … 3.896 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.816 s ± 0.036 s [User: 3.812 s, System: 0.000 s]
Range (min … max): 3.777 s … 3.855 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.813 s ± 0.026 s [User: 3.809 s, System: 0.000 s]
Range (min … max): 3.780 s … 3.834 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.783 s ± 0.010 s [User: 3.779 s, System: 0.000 s]
Range (min … max): 3.773 s … 3.798 s 5 runs
N = 512, arr[i] *= 4
Benchmark 1: ./3
Time (mean ± σ): 1.849 s ± 0.015 s [User: 1.847 s, System: 0.000 s]
Range (min … max): 1.831 s … 1.873 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.846 s ± 0.013 s [User: 1.844 s, System: 0.001 s]
Range (min … max): 1.832 s … 1.860 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.756 s ± 0.012 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.744 s … 1.771 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.788 s ± 0.012 s [User: 1.785 s, System: 0.001 s]
Range (min … max): 1.774 s … 1.801 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.476 s ± 0.015 s [User: 3.472 s, System: 0.001 s]
Range (min … max): 3.458 s … 3.494 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.449 s ± 0.002 s [User: 3.446 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.452 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.456 s ± 0.007 s [User: 3.453 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.462 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.547 s ± 0.044 s [User: 3.542 s, System: 0.001 s]
Range (min … max): 3.482 s … 3.600 s 5 runs
N = 512, arr[i] *= 5
Benchmark 1: ./3
Time (mean ± σ): 1.847 s ± 0.013 s [User: 1.845 s, System: 0.000 s]
Range (min … max): 1.836 s … 1.863 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.830 s ± 0.007 s [User: 1.827 s, System: 0.001 s]
Range (min … max): 1.820 s … 1.837 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.983 s ± 0.017 s [User: 2.980 s, System: 0.000 s]
Range (min … max): 2.966 s … 3.012 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 3.026 s ± 0.039 s [User: 3.021 s, System: 0.001 s]
Range (min … max): 2.989 s … 3.089 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 4.000 s ± 0.021 s [User: 3.994 s, System: 0.001 s]
Range (min … max): 3.982 s … 4.035 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.940 s ± 0.041 s [User: 3.934 s, System: 0.001 s]
Range (min … max): 3.890 s … 3.981 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.928 s ± 0.032 s [User: 3.922 s, System: 0.001 s]
Range (min … max): 3.898 s … 3.979 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.908 s ± 0.029 s [User: 3.904 s, System: 0.000 s]
Range (min … max): 3.879 s … 3.954 s 5 runs
I think the run where -O2 -march=znver1
was the same speed as -O3 -march=znver1
was a mistake on my part with the naming of the files, I had not created the makefile back then yet, I was using my shell's history.