I use the following two makefile to compile my program to do Gaussian blur.
g++ -Ofast -ffast-math -march=native -flto -fwhole-program -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
g++ -O3 -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
My two testing environments are:
- i7 4710HQ 4 cores 8 threads
- E5 2650
However, the first output has 2x speed on E5 but 0.5x speed on i7. The second output behaves faster on i7 but slower on E5.
Can anyone give some explanations?
this is the source code: https://github.com/makeapp007/interpolateFloatImg
I will give out more details as soon as possible.
The program on i7 will be run on 8 threads. I did't know how many threads will this program generate on E5.
==== Update ====
I am the teammate of the original author on this project, and here are the results.
Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
Kernel kernelSize : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height 8533 = 172921245
Micro seconds: 211199093
Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
1423026.281358 task-clock:u (msec) # 6.516 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
2,604 page-faults:u # 0.002 K/sec
4,167,572,543,807 cycles:u # 2.929 GHz (46.79%)
6,713,517,640,459 instructions:u # 1.61 insn per cycle (59.29%)
725,873,982,404 branches:u # 510.092 M/sec (57.28%)
23,468,237,735 branch-misses:u # 3.23% of all branches (56.99%)
544,480,682,764 L1-dcache-loads:u # 382.622 M/sec (37.00%)
545,000,783,842 L1-dcache-load-misses:u # 100.10% of all L1-dcache hits (31.44%)
38,696,703,292 LLC-loads:u # 27.193 M/sec (26.68%)
1,204,703,652 LLC-load-misses:u # 3.11% of all LL-cache hits (35.70%)
218.384387536 seconds time elapsed
And these are the results from the workstation:
workstation:~/mossCAP3/repos/liuyh1_liujzh/12$ perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
Kernel kernelSize : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height 8533 = 172921245
Micro seconds: 133661220
Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
2035379.528531 task-clock (msec) # 14.485 CPUs utilized
7,370 context-switches # 0.004 K/sec
273 cpu-migrations # 0.000 K/sec
3,123 page-faults # 0.002 K/sec
5,272,393,071,699 cycles # 2.590 GHz [49.99%]
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
7,425,570,600,025 instructions # 1.41 insns per cycle [62.50%]
370,199,835,630 branches # 181.882 M/sec [62.50%]
47,444,417,555 branch-misses # 12.82% of all branches [62.50%]
591,137,049,749 L1-dcache-loads # 290.431 M/sec [62.51%]
545,926,505,523 L1-dcache-load-misses # 92.35% of all L1-dcache hits [62.51%]
38,725,975,976 LLC-loads # 19.026 M/sec [50.00%]
1,093,840,555 LLC-load-misses # 2.82% of all LL-cache hits [49.99%]
140.520016141 seconds time elapsed
====Update==== the specification of the E5:
workstation:~$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
20 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
workstation:~$ dmesg | grep cache
[ 0.041489] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[ 0.047512] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[ 0.050088] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.050121] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.558666] PCI: pci_cache_line_size set to 64 bytes
[ 0.918203] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[ 1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[ 1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[ 1.549796] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.552711] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.552955] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA