3

According my previous question my idea was to optimize an algorithm by removing calculations when coefficient m_a, m_b are 1.0 or 0.0. Now I tried to optimize the algorithm and got some curious results which I can´t explain.

First analyzer run for 100k samples. Parameter values are read from file (!):

b0=1.0 b1=-1.480838022915731 b2=1.0

a0=1.0 a1=-1.784147570544337 a2=0.854309980957510

Slow

Second analyzer run same 100k samples. Parameter values are read from file (!):

b0=1.0 b1=-1.480838022915731 b2=1.0

a0=1.0 a1=-1.784147570544337 a2=0.0 <--- Only a2 is different !

Fast

Within the figures the numbers on the left side (grey background) represent the needed CPU cycles. As clearly visible second run with parameter a2=0.0 is a lot faster.

I checked the difference between debug and release code. Release code is faster (as expected). Debug and release code have the same strange behaviour when parameter a2 is modified.

Then I checked the ASM code. I noticed that SSE instructions are used. This is valid because I compiled with /arch:SSE2. Therefore I disabled SSE. The resulting code doesn´t use SSE anymore and the performance does NOT depend on the parameter value a2 anymore (as expected)

Therefore I came to the conclusion that their is some kind of performance benefit when SSE is used and SSE engine detects that a2 is 0.0 and therefore omitts obsolete multiplication and subtraction. I never heard about this and tried to find information but without success.

So has anyone an explanation for my performance results ?

For completeness this is the relevant ASM code for the release version:

00F43EC0  mov         edx,dword ptr [ebx]  
00F43EC2  movss       xmm0,dword ptr [eax+edi*4]  
00F43EC7  cmp         edx,dword ptr [ebx+4]  
00F43ECA  je          $LN419+193h (0F43F9Dh)  
00F43ED0  mov         esi,dword ptr [ebx+4]  
00F43ED3  lea         eax,[edx+68h]  
00F43ED6  lea         ecx,[eax-68h]  
00F43ED9  cvtps2pd    xmm0,xmm0  
00F43EDC  cmp         ecx,esi  
00F43EDE  je          $LN419+180h (0F43F8Ah)  
00F43EE4  movss       xmm1,dword ptr [eax+4]  
00F43EE9  mov         ecx,dword ptr [eax]  
00F43EEB  mov         edx,dword ptr [eax-24h]  
00F43EEE  movss       xmm3,dword ptr [edx+4]  
00F43EF3  cvtps2pd    xmm1,xmm1  
00F43EF6  mulsd       xmm1,xmm0  
00F43EFA  movss       xmm0,dword ptr [ecx]  
00F43EFE  cvtps2pd    xmm4,xmm0  
00F43F01  cvtps2pd    xmm3,xmm3  
00F43F04  mulsd       xmm3,xmm4  
00F43F08  xorps       xmm2,xmm2  
00F43F0B  cvtpd2ps    xmm2,xmm1  
00F43F0F  movss       xmm1,dword ptr [ecx+4]  
00F43F14  cvtps2pd    xmm4,xmm1  
00F43F17  cvtps2pd    xmm2,xmm2  
00F43F1A  subsd       xmm2,xmm3  
00F43F1E  movss       xmm3,dword ptr [edx+8]  
00F43F23  mov         edx,dword ptr [eax-48h]  
00F43F26  cvtps2pd    xmm3,xmm3  
00F43F29  mulsd       xmm3,xmm4  
00F43F2D  subsd       xmm2,xmm3  
00F43F31  movss       xmm3,dword ptr [edx+4]  
00F43F36  cvtps2pd    xmm4,xmm0  
00F43F39  cvtps2pd    xmm3,xmm3  
00F43F3C  mulsd       xmm3,xmm4  
00F43F40  movss       xmm4,dword ptr [edx]  
00F43F44  cvtps2pd    xmm4,xmm4  
00F43F47  cvtpd2ps    xmm2,xmm2  
00F43F4B  xorps       xmm5,xmm5  
00F43F4E  cvtss2sd    xmm5,xmm2  
00F43F52  mulsd       xmm4,xmm5  
00F43F56  addsd       xmm3,xmm4  
00F43F5A  movss       xmm4,dword ptr [edx+8]  
00F43F5F  cvtps2pd    xmm1,xmm1  
00F43F62  movss       dword ptr [ecx+4],xmm0  
00F43F67  mov         edx,dword ptr [eax]  
00F43F69  cvtps2pd    xmm4,xmm4  
00F43F6C  mulsd       xmm4,xmm1  
00F43F70  addsd       xmm3,xmm4  
00F43F74  xorps       xmm1,xmm1  
00F43F77  cvtpd2ps    xmm1,xmm3  
00F43F7B  movss       dword ptr [edx],xmm2  
00F43F7F  movaps      xmm0,xmm1  
00F43F82  add         eax,70h  
00F43F85  jmp         $LN419+0CCh (0F43ED6h)  
00F43F8A  movss       xmm1,dword ptr [ebx+10h]  
00F43F8F  cvtps2pd    xmm1,xmm1  
00F43F92  mulsd       xmm1,xmm0  
00F43F96  xorps       xmm0,xmm0  
00F43F99  cvtpd2ps    xmm0,xmm1  
00F43F9D  mov         eax,dword ptr [ebp-4Ch]  
00F43FA0  movss       dword ptr [eax+edi*4],xmm0  
00F43FA5  mov         ecx,dword ptr [ebp-38h]  
00F43FA8  mov         eax,dword ptr [ebp-3Ch]  
00F43FAB  sub         ecx,eax  
00F43FAD  inc         edi  
00F43FAE  sar         ecx,2  
00F43FB1  cmp         edi,ecx  
00F43FB3  jb          $LN419+0B6h (0F43EC0h)  

Edit: Replaced debug ASM code by release code.

Maik
  • 541
  • 4
  • 15
  • 3
    "corresponding ASM code for the debug version" Why is it relevant? Measuring performance of unoptimised code is pointless. – n. m. could be an AI Nov 08 '14 at 11:57
  • As already stated: Release version is faster but behaviour of debug and release version are equal. I attached debug ASM code because it is easier to read. If you prefer the release ASM code I can attach it as well. – Maik Nov 08 '14 at 12:55
  • Instead of posting assembly listings, post compilable code, we are perfectly able to run the compiler ourselves. – n. m. could be an AI Nov 08 '14 at 13:01
  • Debug version assembly is utterly irrelevant in every fashion. – Puppy Nov 08 '14 at 13:48
  • I replaced the debug ASM code by the release code. What kind of benefit do you have now which is related to my question ? – Maik Nov 08 '14 at 14:13
  • 1
    Very closely related: http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x – Mysticial Nov 08 '14 at 19:05

2 Answers2

7

There are no early outs for FP multiplication on SSE. It's a fully pipelined operation with short latency, so adding early outs would complicate instruction retirement while providing zero performance benefit. The only instructions that commonly have data-dependent execution characteristics on modern processors are divide and square root (ignoring subnormals, which effect a wider array of instructions). This is extensively documented by both Intel and AMD, and also independently by Agner Fog.

So why do you see a change in performance? The most likely explanation is that you are encountering stalls due to subnormal inputs or results; this is very common with DSP filters and delays, like the one that you have. Without seeing your code and the input data, it's impossible to be sure that this is what's happening, but it's by far the most likely explanation. If so, you can fix the problem by setting the DAZ and FTZ bits in MXCSR.

Intel documentation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (consult latency tables in the appendix, note that there's a fixed value for mulss and mulsd.)

AMD 16h instruction latencies (excel spreadsheet): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/AMD64_16h_InstrLatency_1.1.xlsx

Agner Fog's instruction latency tables for both Intel and AMD: http://www.agner.org/optimize/instruction_tables.pdf

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
4

This would be normal behavior, if the FP HW multiplication unit would perform early-out operations. You can see here more details. This means that when the HW detects the 0.0 value, it does not pass it through the whole pipeline.

However you are using SSE mulsd instruction. In his post Stephen Canon pointed that in Intel and AMD implementations, the latency for mulsd instruction is fixed. This indicates that there's no early out functionality associated with SSE.

Also Stephen Canon pointed that performance issues occur when using denormal numbers. In this post you can read more about what's causing that.

However all your coefficients are frequent values which don't seem to be denormals. So the problem could be somewhere else. All the ASM instructions in your code are documented to have fixed latencies but the huge cycles difference indicates that there is something going on.

Your profiling output shows that all your latencies changed, even if the 0.0 coefficient appears only in a few multiplications. Is the result computed correctly? Ar all the other variables constant between runs?

Community
  • 1
  • 1
VAndrei
  • 5,420
  • 18
  • 43
  • @VAndrei: Thank you for this useful answer. I am a bit surprised that I didn´t found this important information. I checked this page previously http://x86.renejeschke.de/html/file_module_x86_id_213.html latency and throughput is listed as a constant value. Note: I updated the title as proposed. – Maik Nov 08 '14 at 13:39
  • SSE **does not** have early outs for FP multiplication. This is well-documented by Intel, AMD, and Agner Fog, which are much more reliable sources than Gamasutra. – Stephen Canon Nov 08 '14 at 17:18
  • @StephenCanon I revised my post, upvoted yours. Thank you very much for providing the links. The OP probably will mark your answer as correct. However I'm concerned about the fact that the values don't seem denormal. – VAndrei Nov 08 '14 at 18:23
  • This sort of delay can easily cause denormal values to result even when the initial data contains only well-scaled data. This is very common, and the reason that much audio processing is done with FTZ. – Stephen Canon Nov 08 '14 at 18:43
  • Just for completeness: The performance issue I detected IS the result of denormal values. If I modify the input signal and ensure that it is not in the range of 0.0 the performance is as expected. I was not aware that denormal values have such an high performance impact ! – Maik Nov 11 '14 at 18:14
  • @Mark thanks for pointing that out. Denormals are a performance problem. So does that mean that you tested the 2 cases with 2 different signals, the first one having some denormal values? – VAndrei Nov 11 '14 at 20:54
  • @VAndrei: the denormal values aren't in his input data, they arise as a result of decay in the feedback loop. – Stephen Canon Nov 12 '14 at 13:50
  • I can avoid the denormal values (which are a result of the feeback loop as Stephen pointed already out) by modifing the input data (dc offset). Maybe this is no practical solution but now I know the root cause of the performance issue :) – Maik Nov 12 '14 at 17:45