1

I want to evaluate an AVX2 program written in c-intrinsics using gcc 5.4.0 and clang 3.8 for compiling and using perf , valgrind and IACA for evaluating and analysis. I Exactly want the same optimization approach so I read this related question clang optimization and this page for gcc optimization option for gcc but I still doubted .

gcc -O2 and gcc -O3 is my basis and want the same in clang since Clang do auto-vectorization in -O2 and I don't want it when comparing the results with gcc -O2 and want it when -O3 is enabled in gcc. so the question is what command should I use in clang that is corresponded to these commands in gcc :

First:

compile : gcc -Wall -O2 -march=native -masm=intel -c -S "%f"

build: gcc -Wall -O2 -mavx2 -o "%e" "%f"

Second: compile : gcc -Wall -O3 -march=native -masm=intel -c -S "%f"

build: gcc -Wall -O3 -mavx2 -o "%e" "%f"

Community
  • 1
  • 1
Amiri
  • 2,417
  • 1
  • 15
  • 42
  • 1
    If you're looking at compiler output for manually-vectorized code, the compiler usually can't auto-vectorize on top of that. But sometimes compilers are dumb and will try to auto-vectorize your scalar cleanup loop or something. So I'd suggest using `-O3 -fno-tree-vectorize` for both gcc and clang to disable auto-vectorization if that's what you want. – Peter Cordes Nov 27 '16 at 02:19
  • Thank you @PeterCordes, Should I use `-mllvm` option in clang? – Amiri Nov 27 '16 at 09:52
  • IDK, I've never used `-mllvm`. What does it do, and why do you think you maybe should use it? – Peter Cordes Nov 27 '16 at 10:16
  • 1
    @PeterCordes, The program is adding to matrix: so does it make any sense? I think we missed some thing. gcc : `gcc -Wall -mavx2 -O3 -fno-tree-vectorize -o "%e" "%f"` The best time is 0.000100 sec in 10001 repetition for 448X448 matrix clang : `clang "%f" -o "%e" -mavx2 -O3 -fno-tree-vectorize` The best time: 0.000067 sec in 10001 repetition for 448X448 matrix – Amiri Nov 27 '16 at 11:32
  • I was reading an article that said "using clang with llvm", (maybe it means back-end and front-end, not using this option in command line) – Amiri Nov 27 '16 at 11:35
  • Those times are pretty short. Maybe put it in a repeat-loop? IDK, look at the asm and / or perf counters and see if clang made a more efficient loop. Also, if you're testing on a Haswell CPU, you should use `-march=haswell` to enable `-mtune=haswell` as well as `-mavx2`. gcc's "generic" tuning still fails to optimize for things like macro-fusion on Intel and AMD CPUs. – Peter Cordes Nov 27 '16 at 11:35
  • I'm using a `skylake` CPU, I used `-march=native`. There is no change. – Amiri Nov 27 '16 at 11:37
  • 1
    http://stackoverflow.com/questions/29946629/how-to-change-llvmpass-long-opt-command-to-a-simple-command says `-mllvm` is not useful on its own. It's just for passing other low-level args to the llvm backend that clang doesn't handle natively. I don't think know if clang supports any backends other than LLVM. – Peter Cordes Nov 27 '16 at 11:37
  • Ok, `-march=native` is good. You left out the `-march=native` part for the command line options you said you benchmarked. – Peter Cordes Nov 27 '16 at 11:39
  • OK, peter when I use `perf` for the clang build file it records but throw an error to report it. gcc build file report for cpu cycle is `1.17 │ 48: vmovss 0x6c5080(%rdx,%rax,1),%xmm1 0.61 │ vaddss 0x601080(%rdx,%rax,1),%xmm1,%xmm1 40.94 │ vmovss %xmm1,0x789080(%rdx,%rax,1) 36.37 │ add $0x4,%rax ` What am I missing to record Clang build file data? – Amiri Nov 27 '16 at 11:43
  • I thought you said you had manually vectorized the code with intrinsics. Those are scalar instructions. Anyway, I have no idea what you're doing wrong with `perf record` / `perf report`. – Peter Cordes Nov 27 '16 at 11:52
  • I work with vectorized but know I'm testing gcc and clan is scalar mode to see the performance – Amiri Nov 27 '16 at 11:54
  • I use sudo perf -e "my needs" ./myBinaryfile and use sudo perf report to annotate the report. it works for `gcc` build file – Amiri Nov 27 '16 at 18:56
  • -Os is the same as -O2 -Oz is based on -Os opt drops: -slp-vectorizer clang drops: -vectorize-loops – Amiri Dec 14 '16 at 06:09
  • 1
    `-Os` also sets tuning options to favour small code-size. It does things like using DIV instead of modular multiplicative inverses for division by constants. – Peter Cordes Dec 14 '16 at 06:41

0 Answers0