3

I use the following two makefile to compile my program to do Gaussian blur.

  1. g++ -Ofast -ffast-math -march=native -flto -fwhole-program -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp

  2. g++ -O3 -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp

My two testing environments are:

  • i7 4710HQ 4 cores 8 threads
  • E5 2650

However, the first output has 2x speed on E5 but 0.5x speed on i7. The second output behaves faster on i7 but slower on E5.

Can anyone give some explanations?

this is the source code: https://github.com/makeapp007/interpolateFloatImg

I will give out more details as soon as possible.

The program on i7 will be run on 8 threads. I did't know how many threads will this program generate on E5.

==== Update ====

I am the teammate of the original author on this project, and here are the results.

Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 211199093
Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
1423026.281358      task-clock:u (msec)       #    6.516 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
         2,604      page-faults:u             #    0.002 K/sec                  
4,167,572,543,807      cycles:u                  #    2.929 GHz                      (46.79%)
6,713,517,640,459      instructions:u            #    1.61  insn per cycle           (59.29%)
725,873,982,404      branches:u                #  510.092 M/sec                    (57.28%)
23,468,237,735      branch-misses:u           #    3.23% of all branches          (56.99%)
544,480,682,764      L1-dcache-loads:u         #  382.622 M/sec                    (37.00%)
545,000,783,842      L1-dcache-load-misses:u   #  100.10% of all L1-dcache hits    (31.44%)
38,696,703,292      LLC-loads:u               #   27.193 M/sec                    (26.68%)
1,204,703,652      LLC-load-misses:u         #    3.11% of all LL-cache hits     (35.70%)
218.384387536 seconds time elapsed

And these are the results from the workstation:

workstation:~/mossCAP3/repos/liuyh1_liujzh/12$  perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 133661220
Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
2035379.528531      task-clock (msec)         #   14.485 CPUs utilized          
         7,370      context-switches          #    0.004 K/sec                  
           273      cpu-migrations            #    0.000 K/sec                  
         3,123      page-faults               #    0.002 K/sec                  
5,272,393,071,699      cycles                    #    2.590 GHz                     [49.99%]
             0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
             0      stalled-cycles-backend    #    0.00% backend  cycles idle   
7,425,570,600,025      instructions              #    1.41  insns per cycle         [62.50%]
370,199,835,630      branches                  #  181.882 M/sec                   [62.50%]
47,444,417,555      branch-misses             #   12.82% of all branches         [62.50%]
591,137,049,749      L1-dcache-loads           #  290.431 M/sec                   [62.51%]
545,926,505,523      L1-dcache-load-misses     #   92.35% of all L1-dcache hits   [62.51%]
38,725,975,976      LLC-loads                 #   19.026 M/sec                   [50.00%]
 1,093,840,555      LLC-load-misses           #    2.82% of all LL-cache hits    [49.99%]
140.520016141 seconds time elapsed

====Update==== the specification of the E5:

workstation:~$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
     20  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
workstation:~$ dmesg | grep cache
[    0.041489] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[    0.047512] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[    0.050088] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    0.050121] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    0.558666] PCI: pci_cache_line_size set to 64 bytes
[    0.918203] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[    1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[    1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[    1.549796] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.552711] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.552955] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
makeapp
  • 181
  • 3
  • 11
  • makeapp, can you post results of `perf stat ./interpolateFloatImg` and `perf stat -d ./interpolateFloatImg` of both programs on both platforms? These results will have real cpu frequency (line `cycles ... GHz`). What is your E5 model (there were different generations of xeon E5 cpus: v1, v2, v3, v4)? No one may answer your question without source code and detailed profiling results with disassembly of hot spot or without capability of reproducing the test on own machine (http://stackoverflow.com/help/mcve - there is no Minimal, Complete, and Verifiable example in your question). – osgx Jun 26 '16 at 16:23
  • makeapp, thank you for code. What is your OS? What is your gcc version (is it same for i7 and E5)? Can you give any image for running your code? What is kernel size and image size (args)? What about 4 outputs (program1 on system1, program1 on system2, program2 on system1, program2 on system2) from `perf stat` and 4 from `perf stat -d`? You have omp parallel for, did you try to limit number of threads to same value (`export OMP_NUM_THREADS=4`) and/or `export OMP_PROC_BIND=true`? – osgx Jun 27 '16 at 01:04
  • For i7, it is arch linux. For E5, it is ubuntu 14.04. For E5, the g++ version is 4.8.2. For i7, the g++ version is 6.1.1.kernel size is 277 10, the precision is 0.002. the testing input image is 1000*1000. Details is on http://shtech.org/course/ca/projects/3/. I did't limit thread numbers. – makeapp Jun 27 '16 at 01:15
  • the kernel size is 277, the standard deviation is 10. I am not sured the computer of E5 will limit the threads, it is my teacher's computer. – makeapp Jun 27 '16 at 01:23
  • makeapp, thank you, will send link to this question to the teacher. What is your input image and running time? What about `perf stat`? – osgx Jun 27 '16 at 01:37
  • Your E5-26*v3 [is haswell](https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Haswell-EP.22_.2822_nm.29_Efficient_Performance), it [has same](http://ark.intel.com/products/81705/Intel-Xeon-Processor-E5-2650-v3-25M-Cache-2_30-GHz) AVX2 vector extensions as i7-4*. Do profile your program with `perf record`/`perf report` to find hot spot (check my updated answer). And try to rewrite reference program. – osgx Jun 27 '16 at 13:30

2 Answers2

4

Based on the compiler flags you indicated, the first Makefile is making use of the -march=native flag which partly explains why you are observing different performance gaps on the two CPUs with or without the flag.

This flag allows GCC to use instructions specific to a given CPU architecture, and that are not necessarily available on a different architecture. It also implies -mtune=native which tunes the compiled code for the specific CPU of the machine and favours instruction sequences that run faster on that CPU. Note that code compiled with -march=native may not work at all if run on a system with a different CPU, or be significantly slower.

So even though the options seem to be the same, they will act differently behind the scenes, depending on the machine you are using to compile. You can find more information about this flag in the GCC documentation.

To see what options are specifically enabled for each CPU, you can run the following command on each of your machines:

gcc -march=native -Q --help=target

In addition, different versions of GCC also have an influence on how different compiler flags will optimise your code, especially the -march=native flag which doesn't have as many tweaks enabled on older versions of GCC (newer architectures weren't necessarily fully supported at the time). This can further explain the gaps you are observing.

Pyves
  • 6,333
  • 7
  • 41
  • 59
  • Pyves, does any modern gcc version use march=native by default? makeapp, which gcc/g++ version do you have? – osgx Jun 27 '16 at 01:00
  • 1
    For E5, the g++ version is 4.8.2. For i7, the g++ version is 6.1.1 – makeapp Jun 27 '16 at 01:12
  • Pyves, there is the source; and it is not good. I was able to optimize it twofold (twice faster) at 277 10 on cropped.bin. – osgx Jun 27 '16 at 05:19
  • makeapp, it is optimizable by editing sources for 277 10 in 7.5 times from original. (still need info about exact model of his E5 xeon) – osgx Jun 27 '16 at 06:06
  • @osgx This option is not enabled by default on any version of GCC; this produces code that might not work at all on other architectures, so you must turn it on manually. – Pyves Jun 27 '16 at 08:36
  • @makeapp Try installing GCC 6.1.1 on the E5, as an alternative-update if need-be; -march=native doesn't have as many tweaks enabled on older versions of GCC (among other reasons because newer architectures weren't necessarily fully supported at the time). – Pyves Jun 27 '16 at 08:36
  • Sorry, I still don't know why using '-ffast-math -march=native -flto -fwhole-program' will bring 2x speed on E5 and 1/2x speed on i7(the speed is compared to the makefile using -O3 ). Why didn't the speed be enhanced both? – makeapp Jun 27 '16 at 12:03
3

Your program has very high cache miss ratio. Is it good for the program or bad for it?

545,000,783,842 L1-dcache-load-misses:u # 100.10% of all L1-dcache hits

545,926,505,523 L1-dcache-load-misses # 92.35% of all L1-dcache hits

Cache sizes may be different in i7 and E5, so it is one source of difference. Other is - different assembler code, different gcc versions, different gcc options.

You should try to look inside the code, find hot spot, analyze how many pixels is processed by commands and how order of processing may be better for cpu and memory. Rewriting the hotspot (the part of code where most time of running is spent) is the key of solving the task http://shtech.org/course/ca/projects/3/.

You may use perf profiler in record / report / annotate mode to find the hot spot (it will be easier if you will recompile project with -g option added):

# Profile program using cpu cycle performance counter; write profile to perf.data file
perf record ./test test_arg1 test_arg2
# Read perf.data file and report functions where time was spent 
#  (Do not change ./test file, or recompile it after record and before report)
perf report
# Find the hotspot in the top functions by annotation
#  you may use Arrows and Enter to do "annotate" action from report; or:
perf annonate -s top_function_name
perf annonate -s top_function_name > annotate_func1.txt

I was able to increase speed for small bin file and 277 10 arguments in 7 times on my mobile i5-4* (intel haswell) with 2 cores (4 virtual cores with HT enabled) and AVX2+FMA.

Rewriting some loops / loop nests is needed. You should understand how CPU cache works and what is easier to it: to miss often or not to miss often. Also, gcc may be dumb and may not always detect pattern of reading the data; this detection may be needed to work on several pixels in parallel.

osgx
  • 90,338
  • 53
  • 357
  • 513
  • Really great thanks. I didn't know any tools to analyse the performance of program before. And what do you mean by finding the hot spot? I am puzzled at how to fit these data in the cache. Also, I use sse to improve the performance, it improves 30% of the time. – makeapp Jun 27 '16 at 11:56
  • it should be 'perf annotate -s top_function_name' – makeapp Jun 27 '16 at 15:14
  • makeapp, you can't just "use" sse (which one? there are sse,see2,sse3,avx,avx2,fma,avx512; some are wider SIMD; check wiki https://en.wikipedia.org/wiki/X86_instruction_listings#SIMD_instructions), you should see what is the code, how it access data, is it high-performance access type or not. Then you should see what is the assembler (what compiler did). In x86_64 world there is ONLY SSE2 to work with float/doubles in hardware; but even SSE2 may be used for scalar operations ("scalar", ss suffix) or for vectorized operations ("packed", ps suffix). It is your task to optimize program, not my. – osgx Jun 27 '16 at 15:21
  • 1
    @makeapp, check this thread http://stackoverflow.com/questions/9936132/why-does-the-order-of-the-loops-affect-performance-when-iterating-over-a-2d-arra Why does the order of the loops affect performance when iterating over a 2D array? – osgx Jul 22 '16 at 07:42