1

I believe it is usual to have such code in C++

for(size_t i=0;i<ARRAY_SIZE;++i)
    A[i]=B[i]*C[i];

One commonly advocated alternation is:

double* pA=A,pB=B,pC=C;
for(size_t i=0;i<ARRAY_SIZE;++i)
    *pA++=(*pB++)*(*pC++);

What I am wondering is, the best way of improving this code, as IMO following things needed to be considered:

  1. CPU cache. How CPUs fill up their caches to gain best hit rate?
  2. I suppose SSE could improve this?
  3. The other thing is, what if the code could be parallelized? E.g. using OpenMP. In this case, pointer trick may not be available.

Any suggestions would be appreciated!

xis
  • 24,330
  • 9
  • 43
  • 59
  • IMO, your compiler should do a fairly good job at optimizing your code for cached access at least. – zneak Jun 08 '11 at 05:03
  • In the second loop why you still using `i` when you have pointers ? `i` is unused there. – iammilind Jun 08 '11 at 05:04
  • @iammilind i is used to count the number of array. it is not a string array anyway. – xis Jun 08 '11 at 05:07
  • @zneak Well, it is better to assume compilers does not optimize too much IMHO. – xis Jun 08 '11 at 05:08
  • 7
    @xis19 - It is better to assume that you cannot beat the compiler with some simple tricks. The compiler writers know them all, and do this for a living. – Bo Persson Jun 08 '11 at 05:13
  • BTW, this particular transformation is called *operator strength reduction* (it replaces a multiply or shift plus two adds with just an add) and as everyone else has said, it's a very well-known optimization. – Ryan Culpepper Jun 08 '11 at 06:06

6 Answers6

5

My g++ 4.5.2 produces absolutely identical code for both loops (having fixed the error in double *pA=A, *pB=B, *pC=C;, and it is

.L3:
    movapd  B(%rax), %xmm0
    mulpd   C(%rax), %xmm0
    movapd  %xmm0, A(%rax)
    addq    $16, %rax
    cmpq    $80000, %rax
    jne .L3

(where my ARRAY_SIZE was 10000)

The compiler authors know these tricks already. OpenMP and other concurrent solutions are worth investigating, though.

Cubbi
  • 46,567
  • 13
  • 103
  • 169
4

The rule for performance are

  1. not yet

  2. get a target

  3. measure

  4. get an idea of how much improvement is possible and verify it is worthwhile to spend time to get it.

This is even more true for modern processors. About your questions:

  1. simple index to pointer mapping is often done by the compilers, and when they don't do it they may have good reasons.

  2. processors are already often optimized to sequential access to the cache: simple code generation will often give the best performance.

  3. SSE can perhaps improve this. But not if you are already bandwidth limited. So we are back to the measure and determine bounds stage

  4. parallelization: same thing as SSE. Using the multiple cores of a single processor won't help if you are bandwidth limited. Using different processor may help depending on the memory architecture.

  5. manual loop unwinding (suggested in a now deleted answer) is often a bad idea. Compilers know how to do this when it is worth-wise (for instance if it can do software pipelining), and with modern OOO processors it is often not the case (it increase the pressure on instruction and trace caches while OOO execution, speculation over jumps and register renaming will automatically brings most of the benefit of unwinding and software pipelining).

AProgrammer
  • 51,233
  • 8
  • 91
  • 143
2

The first form is exactly the sort of structure that your compiler will recognize and optimize, almost certainly emitting SSE instructions automatically.

For this kind of trivial inner loop, cache effects are irrelevant, because you are iterating through everything. If you have nested loops, or a sequence of operations (like g(f(A,B),C)), then you might try to arrange to access small blocks of memory repeatedly to be more cache-friendly.

Do not unroll the loop by hand. Your compiler will already do that, too, if it is a good idea (which it may not be on a modern CPU).

OpenMP can maybe help if the loop is huge and the operations within are complicated enough that you are not already memory-bound.

In general, write your code in a natural and straightforward way, because that is what your optimizing compiler is most likely to understand.

Nemo
  • 70,042
  • 10
  • 116
  • 153
  • What if I think the pointer version *is* the most natural and straightforward? – Billy ONeal Jun 08 '11 at 06:03
  • @Billy: Then you have never programmed in any language other than C/C++? Seriously, twiddling pointers is much more likely to confuse a compiler (and a human reader, too) than simple array references. – Nemo Jun 08 '11 at 06:04
  • I've programmed in lots of languages. (That doesn't mean I don't like C++ more) I often use pointer iteration because I am used to writing STL generic code in C++. But that's not the point of my comment here -- my point is that the big italicized part of your answer should be "write your code in a natural and straightforward way", rather than "Do not use the pointer version". I agree in this case the indexing is probably clearer, but it isn't always. I've seen too much nasty C and C++ caused by those who didn't grok pointers trying to work around it using nasty indexing math and such. – Billy ONeal Jun 08 '11 at 06:08
  • @Billy: Point taken. I have deleted my first sentence. – Nemo Jun 08 '11 at 06:16
1

When to start considering SSE or OpenMP? If both of these are true:

  • If you find that code similar to yours appear 20 times or more in your project:
    for (size_t i = 0; i < ARRAY_SIZE; ++i)
    A[i] = B[i] * C[i];
    or some similar operations
  • If ARRAY_SIZE is routinely bigger than 10 million, or, if profiler tells you that this operation is becoming a bottleneck

Then,

  • First, make it into a function:
    void array_mul(double* pa, const double* pb, const double* pc, size_t count)
    { for (...) }
  • Second, if you can afford to find a suitable SIMD library, change your function to use it.

As a side note, if you have a lot of operations that are only slightly more complicated than this, e.g. A[i] = B[i] * C[i] + D[i] then a library which supports expression template will be useful too.

Community
  • 1
  • 1
rwong
  • 6,062
  • 1
  • 23
  • 51
0

You can use some easy parallelization method. Cuda will be hardware dependent but SSE is almost standard in every CPU. Also you can use multiple threads. In multiple threads you can still use pointer trick which is not very important. Those simple optimizations can be done by compiler as well. If you are using Visual Studio 2010 you can use parallel_invoke to execute functions in parallel without dealing with windows threads. In Linux pThread library is quite easy to use.

Cem Kalyoncu
  • 14,120
  • 4
  • 40
  • 62
0

I think using valarrays are specialised for such calculations. I am not sure if it will improve the performance.

balki
  • 26,394
  • 30
  • 105
  • 151