I have wrote a basic three loop matrix multiplication using a serial and OpenMP implementations. For the same size (3200x3200), the perf stat -a -e instructions,cycles
shows:
Serial
265,755,992,060 instructions # 0.71 insn per cycle
375,319,584,656 cycles
85.923380841 seconds time elapsed
Parallel (16 threads)
264,524,937,733 instructions # 0.30 insn per cycle
883,342,095,910 cycles
13.381343295 seconds time elapsed
In the parallel run, I expect that the number of cycles roughly be the same as serial run. But it isn't.
Any thoughts for the differences?
UPDATE:
I rerun the experiments with 8 and 16 threads since the processor has maximum 16 threads.
Using 8 threads
Max nthread = 16
Total execution Time in seconds: 13.4407235400
MM execution Time in seconds: 13.3349801241
Performance counter stats for 'system wide':
906.51 Joules power/energy-pkg/
264,995,810,457 instructions # 0.59 insn per cycle
449,772,039,792 cycles
13.469242993 seconds time elapsed
and
Using 16 threads
Max nthread = 16
Total execution Time in seconds: 13.2618084711
MM execution Time in seconds: 13.1565077840
Performance counter stats for 'system wide':
1,000.39 Joules power/energy-pkg/
264,309,881,365 instructions # 0.30 insn per cycle
882,881,366,456 cycles
13.289234564 seconds time elapsed
As you see, the wall clocks are the same, roughly but the cycles for 16 threads are 2 times of 8 threads. That means with higher cycle count and lower IPC, it is possible to keep the wall clock as before with more threads. According to perf list
, the event is cpu-cycles OR cycles [Hardware event]
. I would like to know is that the average cycles for one core or aggregate N cores? Any comment on that?