Faster way to copy C array with calculation between

Question

I want to copy an C array data to another, but with a calculation between (i.e. not just copying the same content from one to another, but having a modification in the data):

int aaa;
int src[ARRAY_SIZE];
int dest[ARRAY_SIZE];

//fill src with data

for (aaa = 0; aaa < ARRAY_SIZE; aaa++)
{
    dest[aaa] = src[aaa] * 30;
}

This is done in buffers of size 520 or higher, so the for loop is considerable.

Is there any way to improve performance here in what comes to coding?

I did some research on the topic, but I couldn't find anything specific about this case, only about simple copy buffer to buffer (examples: here, here and here).

Environment: GCC for ARM using Embedded Linux. The specific code above, though, is used inside a C project running inside a dedicated processor for DSP calculations. The general processor is an OMAP L138 (the DSP processor is included in the L138).

Take a look at Duff's Device: http://en.wikipedia.org/wiki/Duff%27s_device — Eugene Sh., Dec 18 '14 at 19:19
@Carlise are you sure that multiple threads would not create cache issues when using multiple cores? in the case of a single one you would pay the context switch between different threads. — DRC, Dec 18 '14 at 19:26
For such a simple case, any half-decent compiler should optimize it for you. It would unroll the loop and use SIMD. Beyond that is core-level parallelization. For example the mentioned OpenMP. — luk32, Dec 18 '14 at 19:26
@EugeneSh. This is a straightforward copy and multiply loop. I would think that Duff's Device would only confuse the compiler, and slow down the actual execution. — Degustaf, Dec 18 '14 at 19:29
If ARRAY_SIZE is fixed, you could try to manually unroll the loop. And I assume you have already turned on optimization in the compiler. — Thomas Padron-McCarthy, Dec 18 '14 at 19:30
@Degustaf Maybe. The latest GCC versions are doing some funny stuff.. But it can be tried. — Eugene Sh., Dec 18 '14 at 19:30
There are plenty of possible optimizations, with varying degrees of portability (eg. SIMD may be an option) and complexity (though using multiple cores is unlikely to pay off for 520 ints). Some, like simple unrolling, the compiler may do for you. However, a survey of all possible optimizations on all platform/compiler combinations is much too broad. If you have an actual performance problem, perhaps you could describe your constraints, platform and compiler. — Useless, Dec 18 '14 at 19:47

Clifford · Answer 1 · 2014-12-18T19:57:31.427

You could try techniques such as loop-unrolling or duff's device, but if you switch on compiler optimisation it will probably do that for you in any case if it is advantageous without making your code unreadable.

The advantage of relying on compiler optimisation is that it is architecture specific; a source-level technique that works on one target may not work so well on another, but compiler generated optimisations will be specific to the target. For example there is no way to code specifically for SIMD instructions in C, but the compiler may generate code to take advantage of them, and to do that, it is best to keep the code simple and direct so that the compiler can spot the idiom. Writing weird code to "hand optimise" can defeat the optimizer and stop it doing its job.

Another possibility that may be advantageous on some targets (if you are only ever coding for desktop x86 targets, this may be irrelevant), is to avoid the multiply instruction by using shifts:

Given that x * 30 is equivalent to x * 32 - x * 2, the expression in the loop can be replaced with:

input[aaa] = (output[aaa] << 5) - (output[aaa] << 1) ;

But again the optimizer may well do that for you; it will also avoid the repeated evaluation of output[aaa], but if that were not the case, the following may be beneficial:

int i = output[aaa] ;
input[aaa] = (i << 5) - (i << 1) ;

The shifting technique is likely to be more advantageous for division operations which are far more expensive on most targets, and it is only applicable to constants.

These techniques are likely to improve the performance of unoptimized code, but compiler optimisations will likely do far better, and the original code may optimise better than "hand optimised" code.

In the end if it is important, you have to experiment and perform timing tests or profiling.

Faster way to copy C array with calculation between

1 Answers1