So I have the results for some experiments and I need to write about the efficiency of pipelining.
I don't have the source code but I do have the time it took for a 4 layer pipeline and an 8 layer pipeline to sum an array of 100,000,000 doubles.
The sum was performed the following way.
For the 4-layer pipeline
d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
for (int i = 0; i < N; i = i + 4) {
d0 = d0 + a[i + 0];
d1 = d1 + a[i + 1];
d2 = d2 + a[i + 2];
d3 = d3 + a[i + 3];
}
c = d0 + d1 + d2 + d3;
for the 8 layer pipeline
d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
d4 = 0.0; d5 = 0.0; d6 = 0.0; d7 = 0.0;
for (int i = 0; i < N; i = i + 8) {
d0 = d0 + a[i + 0];
d1 = d1 + a[i + 1];
d2 = d2 + a[i + 2];
d3 = d3 + a[i + 3];
d4 = d4 + a[i + 4];
d5 = d5 + a[i + 5];
d6 = d6 + a[i + 6];
d7 = d7 + a[i + 7];
}
c = d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7;
The results I have show the following time values for No pipeline , 2 layer pipeline , 4 layer pipeline and 8 layer pipeline. The code for the no pipeline and 2 -layer pipeline is similar to the ones I showed above. The results are averaged over 10 runs and are as follows. The experiment was run in an Intel Core i7-9750H Processor.
- No Pipeline : 0.106 secs
- 2-Layer-Pipeline: 0.064 secs
- 4-Layer-Pipeline: 0.046 secs
- 8-Layer-Pipeline: 0.048 secs
It is evident that from no pipeline to 4 pipeline the effiency gets better but I'm trying to think of ways as to why the efficiency actually got worst from the 4-layer pipeline to the 8 layer-pipeline. Considering that the sum is done by different registers then there shouldn't be any type of dependency hazard affecting the values. One Idea that I had is that maybe there aren't enough ALUs to process more than 4 floating point numbers at one time and this causes stalls but then wouldn't it at least perform better than the 4 stage pipeline. I have plotted the processes in excel to try to find where the stalls/bubbles are happening but I can't see any.