1

I'm trying to optimize a block of instructions in a loop, called thousands of time, which is the bottleneck in my algorithm.

This block of code compute the multiplication of a N matrices 3x3 (iA array) against N vectors 3 (iV array) and store the N results in oV array. (N is not fix and is usually between 3000 and 15000)

Each line of matrices and vectors are 128-bits aligned (4 floats) to exploit SSE optimisation (the 4th floating value is ignored).

C++ code :

  __m128* ip = (__m128*)iV;
  __m128* op = (__m128*)oV;
  __m128* A = (__m128*)iA;

  __m128 res1, res2, res3;
  int i;

  for (i=0; i<N; i++)
  {
    res1 = _mm_dp_ps(*A++, *ip, 0x71);
    res2 = _mm_dp_ps(*A++, *ip, 0x72);
    res3 = _mm_dp_ps(*A++, *ip++, 0x74);

    *op++ = _mm_or_ps(res1, _mm_or_ps(res2, res3));
  }

The compiler generates these instructions :

000007FEE7DD4FE0  movaps      xmm2,xmmword ptr [rsi]               //move "ip" in register
000007FEE7DD4FE3  movaps      xmm1,xmmword ptr [rdi+10h]           //move second line of A in register
000007FEE7DD4FE7  movaps      xmm0,xmmword ptr [rdi+20h]           //move third line of A in register
000007FEE7DD4FEB  inc         r11d                                 //i++
000007FEE7DD4FEE  add         rbp,10h                              //op++
000007FEE7DD4FF2  add         rsi,10h                              //ip++
000007FEE7DD4FF6  dpps        xmm0,xmm2,74h                        //dot product of 3rd line of A against ip
000007FEE7DD4FFC  dpps        xmm1,xmm2,72h                        //dot product of 2nd line of A against ip
000007FEE7DD5002  orps        xmm0,xmm1                            //"merge" of the result of the two dot products
000007FEE7DD5005  movaps      xmm3,xmmword ptr [rdi]               //move first line of A in register
000007FEE7DD5008  add         rdi,30h                              //A+=3
000007FEE7DD500C  dpps        xmm3,xmm2,71h                        //dot product of 1st line of A against ip
000007FEE7DD5012  orps        xmm0,xmm3                            //"merge" of the result
000007FEE7DD5015  movaps      xmmword ptr [rbp-10h],xmm0           //move result in memory (op)
000007FEE7DD5019  cmp         r11d,dword ptr [rbx+28h]             //compare i
000007FEE7DD501D  jl          MyFunction+370h (7FEE7DD4FE0h)       //loop

I'm not very familiar with low-level optimisations, so the question is : Do you see some possible optimisations if I write assembly code myself ?

For example, will it run faster if I change :

add         rbp,10h
movaps      xmmword ptr [rbp-10h],xmm0

by

movaps      xmmword ptr [rbp],xmm0
add         rbp,10h

I have also read that ADD instruction is faster than INC...

Michael M.
  • 2,556
  • 1
  • 16
  • 26
  • For the particular microoptimizations you asked about (the ordering of `add` / `movaps` and `add` vs. `inc`) probably depend on the specific CPU type you're coding for, phase of moon, etc. I imagine you wouldn't see a measurable difference even if you tried, since any difference in speed would be overlapped with the time spent in the main calculation. Big-picture wise, I don't see a lot of fat on that code. Maybe an SSE expert can comment on how well the vectorization maps onto Intel's vector instructions. – Joe Z Aug 09 '13 at 14:25
  • *"I have also read that ADD instruction is faster than INC..."* -- really? I don't know crap about assembly, but that seems odd. If it's the case, why would you ever use INC when you could just use ADD with an argument of 1? Why would INC even exist? – Benjamin Lindley Aug 09 '13 at 14:25
  • The `add` vs `inc` thing has to do with the flags, reading the carry flag after an `inc` can give trouble. On P4 it sort of works the other way around, `inc` there has to wait for the flags. In this case, you're immediately overwriting the flags, so this `inc` is not a problem on anything but P4 (and seriously, who cares about P4) – harold Aug 09 '13 at 14:27
  • @BenjaminLindley because it already existed. They can't just remove it.. the problems with `inc` did not exist in the 8086, it just executed instructions serially one at the time. – harold Aug 09 '13 at 14:29
  • @BenjaminLindley regarding the add/inc thing you can refer here http://stackoverflow.com/questions/13383407/is-add-1-really-faster-than-inc-x86 http://stackoverflow.com/questions/25853357/why-does-icc-produce-inc-instead-of-add-in-assembly-on-x86 http://stackoverflow.com/questions/19890157/gcc-doesnt-make-use-of-inc – phuclv Jan 28 '15 at 16:24

3 Answers3

3

Calculating indirect address with offset, such as rbp-10 is very cheap, because there is special hardware for these sort of calculations in the "effective address calculation" unit [which I think has a different name, but can't think of or have any success with google to find it's name].

There is, however, a dependency between the add rbp,10h and [rbp-10h], which could possibly cause a problem - but I doubt it in this particular case. In your case, there is a long distance between rbp-10 and using it, so it's not an issue. The compiler is probably putting it that far up because it's "free" at that point, since the processor will be waiting for the data to come in from the outside into the xmm registers that has been read earlier. In other words, any work we can stick between the reads of xmm0, xmm1 and xmm2 at the beginning of the loop, and the dpps instructions using xmm0, xmm1 and xmm2 will be beneficial, because the processor will be waiting for that data to "arrive" before it can compute the dpps result.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
2

I've done lots of x86 assembly optimizations, and I can tell you it was a great learning experience. It taught me a lot about how compilers work as well, and the biggest thing I learned was that compilers are in general pretty good at what they do. I know that's a flippant comment, but it is true...

I also learned that optimizations you make can have a positive result on one processor family, and a negative result on another processor family. Things like pipelining, branch prediction, and processor cache play a huge role... so unless you're targeting a very specific hardware configuration, be careful about assumptions regarding improvements you make.

To your specific question about reordering the add to remove the rbp-10h offset... it look like an obvious improvement, and you'd have to verify by reading the instruction manual, but I'd guess the -10h memory offset comes for free in that instruction. And moving the add may throw off a pipelined instruction and actually cause a clock cycle loss. You'd have to experiment.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
mark
  • 5,269
  • 2
  • 21
  • 34
  • Yeah, modern x86s are temperamental beasts for fine-grain optimization. Even the byte-alignment of individual instructions can make a difference, depending on the processor. – Joe Z Aug 09 '13 at 14:59
1

There are a few things you could do to the above code to improve it. Generally, using a value after it has been altered incurs a processor stall as it waits for the result. So these lines would incur a penalty:-

add         rbp,10h
movaps      xmmword ptr [rbp-10h],xmm0

but in the code snippet above those two lines a quite far apart, so that isn't really an issue. As others have already said, the rbp-10h is 'free' in that the address calculation hardware handles it.

You could move the movaps xmm3,xmmword ptr [rdi] up a line and maybe rearrange a couple of other lines.

Would it be worth it?

NO

You'd be lucky to see any real performance gain from any of this because your algorithm is

<blink> memory bandwidth limited </blink>*

which means that the the time taken to read the data from RAM into the CPU is greater than the time it takes the CPU to do its processing. At worst, reading a memory address can involve a page fault and a disk read. The prefetch instructions won't help either, it's called 'Streaming SIMD Extension' because it's optimised to stream data into the CPU (the memory interface can handle four separate streams IIRC).

If you were doing a lot of computation on a small set of data (an FFT perhaps) then you could gain a lot from hand-crafting the assembler. But your algorithm is quite simple so the CPU is idling a lot of the time waiting for the data to arrive. If you know the speed of your RAM you could work out the maximum throughput for the algorithm and use that to compare against what it's currently achieving (you'll never reach the maximum theoretical throughput though).

There are things you can do to minimise the memory stalling, and it's a higher level change rather than fiddling with individual instructions (often, optimising the algorithms gets better results). The simplest is to double buffer the input data. Divide the register set into two groups (possible to do here as you're only using four of the SIMD registers):-

  load set 1
mainloop:
  load set 2
  do processing on set 1
  save set 1 result
  load set 1
  do processing on set 2
  save set 2 result
  goto mainloop

Hopefully that's given you some ideas. Even if it doesn't go much faster, it's still an interesting exercise and you can learn a lot from it.

  • RIP blink.
Skizz
  • 69,698
  • 10
  • 71
  • 108