According my previous question my idea was to optimize an algorithm by removing calculations when coefficient m_a, m_b are 1.0 or 0.0. Now I tried to optimize the algorithm and got some curious results which I can´t explain.
First analyzer run for 100k samples. Parameter values are read from file (!):
b0=1.0 b1=-1.480838022915731 b2=1.0
a0=1.0 a1=-1.784147570544337 a2=0.854309980957510
Second analyzer run same 100k samples. Parameter values are read from file (!):
b0=1.0 b1=-1.480838022915731 b2=1.0
a0=1.0 a1=-1.784147570544337 a2=0.0 <--- Only a2 is different !
Within the figures the numbers on the left side (grey background) represent the needed CPU cycles. As clearly visible second run with parameter a2=0.0 is a lot faster.
I checked the difference between debug and release code. Release code is faster (as expected). Debug and release code have the same strange behaviour when parameter a2 is modified.
Then I checked the ASM code. I noticed that SSE instructions are used. This is valid because I compiled with /arch:SSE2. Therefore I disabled SSE. The resulting code doesn´t use SSE anymore and the performance does NOT depend on the parameter value a2 anymore (as expected)
Therefore I came to the conclusion that their is some kind of performance benefit when SSE is used and SSE engine detects that a2 is 0.0 and therefore omitts obsolete multiplication and subtraction. I never heard about this and tried to find information but without success.
So has anyone an explanation for my performance results ?
For completeness this is the relevant ASM code for the release version:
00F43EC0 mov edx,dword ptr [ebx]
00F43EC2 movss xmm0,dword ptr [eax+edi*4]
00F43EC7 cmp edx,dword ptr [ebx+4]
00F43ECA je $LN419+193h (0F43F9Dh)
00F43ED0 mov esi,dword ptr [ebx+4]
00F43ED3 lea eax,[edx+68h]
00F43ED6 lea ecx,[eax-68h]
00F43ED9 cvtps2pd xmm0,xmm0
00F43EDC cmp ecx,esi
00F43EDE je $LN419+180h (0F43F8Ah)
00F43EE4 movss xmm1,dword ptr [eax+4]
00F43EE9 mov ecx,dword ptr [eax]
00F43EEB mov edx,dword ptr [eax-24h]
00F43EEE movss xmm3,dword ptr [edx+4]
00F43EF3 cvtps2pd xmm1,xmm1
00F43EF6 mulsd xmm1,xmm0
00F43EFA movss xmm0,dword ptr [ecx]
00F43EFE cvtps2pd xmm4,xmm0
00F43F01 cvtps2pd xmm3,xmm3
00F43F04 mulsd xmm3,xmm4
00F43F08 xorps xmm2,xmm2
00F43F0B cvtpd2ps xmm2,xmm1
00F43F0F movss xmm1,dword ptr [ecx+4]
00F43F14 cvtps2pd xmm4,xmm1
00F43F17 cvtps2pd xmm2,xmm2
00F43F1A subsd xmm2,xmm3
00F43F1E movss xmm3,dword ptr [edx+8]
00F43F23 mov edx,dword ptr [eax-48h]
00F43F26 cvtps2pd xmm3,xmm3
00F43F29 mulsd xmm3,xmm4
00F43F2D subsd xmm2,xmm3
00F43F31 movss xmm3,dword ptr [edx+4]
00F43F36 cvtps2pd xmm4,xmm0
00F43F39 cvtps2pd xmm3,xmm3
00F43F3C mulsd xmm3,xmm4
00F43F40 movss xmm4,dword ptr [edx]
00F43F44 cvtps2pd xmm4,xmm4
00F43F47 cvtpd2ps xmm2,xmm2
00F43F4B xorps xmm5,xmm5
00F43F4E cvtss2sd xmm5,xmm2
00F43F52 mulsd xmm4,xmm5
00F43F56 addsd xmm3,xmm4
00F43F5A movss xmm4,dword ptr [edx+8]
00F43F5F cvtps2pd xmm1,xmm1
00F43F62 movss dword ptr [ecx+4],xmm0
00F43F67 mov edx,dword ptr [eax]
00F43F69 cvtps2pd xmm4,xmm4
00F43F6C mulsd xmm4,xmm1
00F43F70 addsd xmm3,xmm4
00F43F74 xorps xmm1,xmm1
00F43F77 cvtpd2ps xmm1,xmm3
00F43F7B movss dword ptr [edx],xmm2
00F43F7F movaps xmm0,xmm1
00F43F82 add eax,70h
00F43F85 jmp $LN419+0CCh (0F43ED6h)
00F43F8A movss xmm1,dword ptr [ebx+10h]
00F43F8F cvtps2pd xmm1,xmm1
00F43F92 mulsd xmm1,xmm0
00F43F96 xorps xmm0,xmm0
00F43F99 cvtpd2ps xmm0,xmm1
00F43F9D mov eax,dword ptr [ebp-4Ch]
00F43FA0 movss dword ptr [eax+edi*4],xmm0
00F43FA5 mov ecx,dword ptr [ebp-38h]
00F43FA8 mov eax,dword ptr [ebp-3Ch]
00F43FAB sub ecx,eax
00F43FAD inc edi
00F43FAE sar ecx,2
00F43FB1 cmp edi,ecx
00F43FB3 jb $LN419+0B6h (0F43EC0h)
Edit: Replaced debug ASM code by release code.