Algorithm: taking out every 4th item of an array

Question

I have two huge arrays (int source[1000], dest[1000] in the code below, but having millions of elements in reality). The source array contains a series of ints of which I want to copy 3 out of every 4.

For example, if the source array is:

int source[1000] = {1,2,3,4,5,6,7,8....};
int dest[1000];

Here is my code:

for (int count_small = 0, count_large = 0; count_large < 1000; count_small += 3, count_large +=4)
    {
      dest[count_small] = source[count_large];
      dest[count_small+1] = source[count_large+1];
      dest[count_small+2] = source[count_large+2];
    }

In the end, dest console output would be:

1 2 3 5 6 7 9 10 11...

But this algorithm is so slow! Is there an algorithm or an open source function that I can use / include?

Thank you :)

Edit: The actual length of my array would be about 1 million (640*480*3)

Edit 2: Processing this for loop takes about 0.98 seconds to 2.28 seconds, while the other code only take 0.08 seconds to 0.14 seconds, so the device uses at least 90 % cpu time only for the loop

Is this a C or C# question, Guntram? Your tags say C, but I wasn't sure if that was a mistake — Andy Brown, Feb 11 '14 at 14:10
There is few optimization which compiler would do for you anyway. So there is no faster way. — Ari, Feb 11 '14 at 14:12
So slow ??? There are just 1000 elements and your C++ (?) loop does the things in an optimal way. Must be blazingly fast (under a microsecond) ! What is it that you are not telling us ? — , Feb 11 '14 at 14:14
I think this is already as fast as you can get with a loop. So rather than looking for another algorithm, you should look for a new computer :-) — Thorsten Kettner, Feb 11 '14 at 14:16
Do you really see a `[1000]` as a huge array? It even fits into one RAM page... — glglgl, Feb 11 '14 at 14:16
What processor are you using? Some (DSPs in particular) have embedded 2D DMA-Copy operations, so you can refer to `source` as a matrix of N x 4, and copy from it a block of N x 3 into `dest`. — barak manos, Feb 11 '14 at 14:17
What makes you say that your solution is slow ? Whatever else you do with these elements in your program will take longer ! — , Feb 11 '14 at 14:21
@GuntramGöres Copying like this will take O(N) no matter what you do so please separate the code and test it is surely not the bottleneck be for even million elements c++ is extremely fast — Vikram Bhat, Feb 11 '14 at 14:26
@GuntramGöres What is your RAM because you need 8mb to fit that and as you are working on embedded system if feel you lack that much ? — Vikram Bhat, Feb 11 '14 at 14:31
I know that copying the entire array with memcpy (without extracting every 4th item) is a lot faster, but maybe memcpy doesn't perform a deepcopy? RAM on the device is 512 MB total — Guntram, Feb 11 '14 at 14:38
What you are doing is a form of decimation, so perhaps you might get more insights on the http://dsp.stackexchange.com site — Peter M, Feb 11 '14 at 14:49
This is video processing, right? If you have an array of `int` you already did something wrong, you are supposed to have an array of 8-bit entities, like `char` or `byte`. — pentadecagon, Feb 11 '14 at 15:10
Look similar question here: http://stackoverflow.com/questions/6804101/fast-method-to-copy-memory-with-translation-argb-to-bgr/6804914#6804914 — MBo, Feb 11 '14 at 15:25
It is video processing, that's right, and you are also right with the type, it's a char array. @MBo, that's a very interesting link, with a nice idea, thank you. Doing bit operations might also speed things up — Guntram, Feb 11 '14 at 15:59

Sean Bright · Answer 1 · 2014-02-11T16:09:23.983

3

You could try memcpy instead of the individual assignments:

memcpy(&dest[count_small], &source[count_large], sizeof(int) * 3);

edited Feb 11 '14 at 16:09

answered Feb 11 '14 at 14:11

Sean Bright

118,630
17
138
146

It won't help. Compilers are smart, so optimizations like this would be done automatically. And even if not. It won't make so slow algorithm fast. – Ari Feb 11 '14 at 14:12
That a function-call; Not sure it will improve performance, especially if you enable compiler optimization for loop-unrolling! – barak manos Feb 11 '14 at 14:13
I will test with `-O3` and see what it looks like. – Sean Bright Feb 11 '14 at 14:13
2

Hard-coding `12` is very bad. At least, I'd do `sizeof(source[0])*3`. – glglgl Feb 11 '14 at 14:15
FWIW - The `memcpy` method is ~15% faster than the 'one at a time' method with `-O3` (gcc 4.6.3) – Sean Bright Feb 11 '14 at 14:28
Actually I used memcpy before, when trying to copy the entire array. This was a lot faster then my loops, but I just couldn't find a way to extract every 4th Byte. But I'll give that a try now and see if it speeds up the code – Guntram Feb 11 '14 at 14:41
I just tried it with your code, it speeds up the process nicely! With for loops, time required shifts between 0.98 seconds and 2.28 seconds, with memcpy it remains steadily at 0.70 seconds, which would be an increase of 30%. Thank you! – Guntram Feb 11 '14 at 16:10
You're welcome. I would add that there are SIMD instructions available for ARM (as others have pointed out) that could decrease the time even more. Something to keep in mind. – Sean Bright Feb 11 '14 at 16:15

Sneftel · Answer 2 · 2014-02-11T14:20:32.643

Well, the asymptotic complexity there is as good as it's going to get. You might be able to achieve slightly better performance by loading in the values as four 4-way SIMD integers, shuffling them into three 4-way SIMD integers, and writing them back out, but even that's not likely to be hugely faster.

With that said, though, the time to process 1000 elements (Edit: or one million elements) is going to be utterly trivial. If you think this is the bottleneck in your program, you are incorrect.

score 3 · Answer 3 · edited May 23 '17 at 12:13

Before you do much more, try profiling your application and determine if this is the best place to spend your time. Then, if this is a hot spot, determine how fast is it, and how fast you need it to be/might achieve? Then test the alternatives; the overhead of threading or OpenMP might even slow it down (especially, as you now have noted, if you are on a single core processor - in which case it won't help at all). For single threading, I would look to memcpy as per Sean's answer.

@Sneftel has also reference other options below involving SIMD integers.

One option would be to try parallel processing the loop, and see if that helps. You could try using the OpenMP standard (see Wikipedia link here), but you would have to try it for your specific situation and see if it helped. I used this recently on an AI implementation and it helped us a lot.

#pragma omp parallel for
for (...)
{
   ... do work
}

Other than that, you are limited to the compiler's own optimisations.

You could also look at the recent threading support in C11, though you might be better off using pre-implemented framework tools like parallel_for (available in the new Windows Concurrency Runtime through the PPL in Visual Studio, if that's what you're using) than rolling your own.

parallel_for(0, max_iterations,
    [...] (int i)
    {
        ... do stuff
    }
);

Inside the for loop, you still have other options. You could try a for loop that iterates and skips every for, instead of doing 3 copies per iteration (just skip when (i+1) % 4 == 0), or doing block memcopy operations for groups of 3 integers as per Seans answer. You might achieve slightly different compiler optimisations for some of these, but it is unlikely (memcpy is probably as fast as you'll get).

for (int i = 0, int j = 0; i < 1000; i++)
{
  if ((i+1) % 4 != 0)
  { 
    dest[j] = source[i];
    j++;
  }
}

You should then develop a test rig so you can quickly performance test and decide on the best one for you. Above all, decide how much time is worth spending on this before optimising elsewhere.

IMHO it's really unlikely that OpenMP will provide a benefit here, and as you say, the overhead might well make it significantly slower. The current code will already be saturating the memory bandwidth, and unless he's really careful with allocation and segmenting, using multiple threads will introduce false sharing penalties. — Sneftel, Feb 11 '14 at 14:20
Agreed - but if (and only if) this is an identified bottleneck that needs addressing for performance, then it would be worth quickly trying out different options and perf testing them — Andy Brown, Feb 11 '14 at 14:23
I'd hope this solution works, I'll try it as well! Unfortunately, the processor running it is a single core processor, so I'm not sure multi threading works on this device — Guntram, Feb 11 '14 at 14:25
If you only have a single core then forget multithreading, it will only slow you down. — Andy Brown, Feb 11 '14 at 14:36
While the `(i+1) % 4 != 0` code looked quite nice, unfortunately it didn't speed things up, speed remained steadily at about 2.09 seconds to 2.27 seconds. Thank you anyway :) — Guntram, Feb 11 '14 at 16:43
I don't think you'll get better than `memcpy`, and you can't parallelise if you only have a single cpu core. — Andy Brown, Feb 11 '14 at 17:20

score 0 · Answer 4 · answered Feb 11 '14 at 14:16

Is your array size only a 1000? If so, how is it slow? It should be done in no time! As long as you are creating a new array and for a single threaded application, this is the only away AFAIK.

However, if the datasets are huge, you could try a multi threaded application.

Also you could explore having a bigger data type holding the value, such that the array size decreases... That is if this is viable to your real life application.

score 0 · Answer 5 · answered Feb 11 '14 at 15:10

0

If you have Nvidia card you can consider using CUDA. If thats not the case you can try other parallel programming methods/environments as well.

answered Feb 11 '14 at 15:10

Fekete Ferenc

119
2
11

Algorithm: taking out every 4th item of an array

5 Answers5