32

Let's say we have an array of ints like this:

const int size = 100000;
int array[size];
//set some items to 0 and other items to 1

I'd like to replace all items that have value of 1 with another value, for example 123456. This can be trivially implemented with:

for(int i = 0; i < size ; i++){
    if(array[i] != 0) 
        array[i] = 123456;
}

Out of curiosity, is there a faster way to do this, by some kind of x86 trickery, or is this the best code for the processor?

Werner Henze
  • 16,404
  • 12
  • 44
  • 69
Axarydax
  • 16,353
  • 21
  • 92
  • 151
  • This isnt what you are looking for but `*(array+i)` can reduce runtime a bit. Gud qn tho! – Suvarna Pattayil Apr 26 '13 at 07:42
  • You could partition the array and distribute the partitions to separate threads - but you need to consider thread creation overhead – Andreas Fester Apr 26 '13 at 07:42
  • In C that's pretty much it, but you could of course investigate ways to parallelize it at the lowest level by using multimedia instructions. – unwind Apr 26 '13 at 07:45
  • 11
    @SuvP: This depends on the compiler and platform used and is not generally true. – Werner Henze Apr 26 '13 at 08:20
  • @Werner, Thanks for that, I wasn't aware that is dependent on the OS and implementation. I thought that is how all arrays are translated internally – Suvarna Pattayil Apr 26 '13 at 08:28
  • 3
    @SuvP if that's how arrays are translated internally, then why does it matter which way you write it? – harold Apr 26 '13 at 08:39
  • 3
    @SuvP modern processors can do addressing with offset given by a second register and that is how any decent compiler nowadays implements `array[i]`. There is usually no explicit arithmetic taking place. – Jens Gustedt Apr 26 '13 at 08:48
  • 5
    Have you confirmed (by profiling or otherwise) that the trivial implementation is an actual performance bottleneck in your particular case? (I'll admit; I've been working on relatively performance-critical code recently, where optimizations like this actually could potentially make a real difference. Pretty much the very first thing I did was write a small tool that allowed me to benchmark various alternative implementations. Running that through a profiler pointed me to parts of the code I didn't even imagine at first needed optimization, and by trivial changes I improved performance ~30%.) – user Apr 26 '13 at 09:04
  • @harold Well, i guess i was under [this](http://stackoverflow.com/questions/2305770/efficiency-arrays-vs-pointers) assumption in contect of modern compilers too – Suvarna Pattayil Apr 26 '13 at 09:06
  • [Fixed link] If the number of 1 elements is relatively small, you might want to scan for them as per Tomas's REPE SCASD example at http://stackoverflow.com/questions/1493936 – Tony Delroy Apr 26 '13 at 10:26
  • 2
    @SuvP: A direct pointer lookup is faster than an array lookup, because *p is just a dereference and arr[1] requires an offset from arr, and *then* a dereference. *(arr + 1) isn't faster, you're just moving the offset calculation. As JensGustedt said, modern processors have hardware support for this very common action, making that explanation a massive oversimplification. In any case, don't do *(arr+offset) unless you have good reason, because if it was faster that's what the compiler would be rewriting your array lookup to. – Phoshi Apr 26 '13 at 12:57
  • You also could set the values to 123456 instead of 1 in the first place... – Sven Apr 27 '13 at 10:23
  • A comment on a recent benchmarking test on Intel i7@3.7G: Abhinav ~ 154-231 us, Nicu - 170-293 us. Others come in a bit slower. Looks like on of the most unvoted answers is truly the best. – Paul Childs Mar 06 '19 at 01:12

8 Answers8

47

For your specific case where you initially have 0 and 1, the following might be faster. You'll have to bench mark it. You probably can't do much better with plain C though; you may need to dive into assembly if you want to take advantage of "x86 trickery" that may exist.

for(int i = 0; i < size ; i++){
  array[i] *= 123456;
}

EDIT:

Benchmark code:

#include <time.h>
#include <stdlib.h>
#include <stdio.h>

size_t diff(struct timespec *start, struct timespec *end)
{
  return (end->tv_sec - start->tv_sec)*1000000000 + end->tv_nsec - start->tv_nsec;
}

int main(void)
{
  const size_t size = 1000000;
  int array[size];

  for(size_t i=0; i<size; ++i) {
    array[i] = rand() & 1;
  }

  struct timespec start, stop;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(size_t i=0; i<size; ++i) {
    array[i] *= 123456;
    //if(array[i]) array[i] = 123456;
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &stop);

  printf("size: %zu\t nsec: %09zu\n", size, diff(&start, &stop));
}

my results:

Computer: quad core AMD Phenom @2.5GHz, Linux, GCC 4.7, compiled with

$ gcc arr.c -std=gnu99 -lrt -O3 -march=native
  • if version: ~5-10ms
  • *= version: ~1.3ms
ajay
  • 9,402
  • 8
  • 44
  • 71
Nicu Stiurca
  • 8,747
  • 8
  • 40
  • 48
  • 12
    +1 for the benchmarking advice - this is a "must have" for all kind of performance related issues – Andreas Fester Apr 26 '13 at 07:43
  • 4
    Hem, no, multiplication is *way* slower than comparison + addition. So I guess you're not saving time with this. Sure it's shorter, but it will be slower. – Gui13 Apr 26 '13 at 07:44
  • 2
    + add openMP: `#pragma omp parallel for …` – Eddy_Em Apr 26 '13 at 07:47
  • 8
    @xgbi integer mult by 1 or 0 is quite fast actually, and the inner loop is branch free. Still, it would need to be benchmarked. – Nicu Stiurca Apr 26 '13 at 07:48
  • I've considered this, it is almost as fast as the original – Axarydax Apr 26 '13 at 08:11
  • 1
    @Axarydax on my machine, the branchless `*=` is almost an order of magnitude faster... – Nicu Stiurca Apr 26 '13 at 08:22
  • 1
    @Axarydax is there a particular pattern to the 0's and 1's? AFAIK the only way the original would be about as fast is if that branch is correctly predicted a lot. – harold Apr 26 '13 at 08:25
  • 1
    I'll be damned, it really depends on the data. It looks like I have to use real data and benchmark both solutions – Axarydax Apr 26 '13 at 08:40
  • 5
    I think if you use `if`, you basically fail in [branch prediction](http://stackoverflow.com/a/11227902/1386111), that's why multiplication will be faster. – Alvin Wong Apr 26 '13 at 08:52
  • 5
    `-array[i] & 123456` might be slightly faster. – Aki Suihkonen Apr 26 '13 at 10:57
  • @AlvinWong, agreed. If the list happened to be sorted, the branches would probably be faster or near the same. Please see: http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array – user606723 Apr 26 '13 at 11:01
  • It's better to use just `array[i] = array[i] ? 123456 : 0;` in the loop body. The compiler has no way to know that the array contains only ones and zeros and therefore cannot eliminate the multiplication. But the idea is right, this code is limited by the branch predictor and eliminating the conditional branch is the key. – Mackie Messer Apr 26 '13 at 12:19
  • @MackieMesser AFAIK (though I won't rule out the possibility) the x86 does not have a conditional operator as a machine code instruction, so `a = x ? y : z;` is really little more than syntactic sugar for `if(x) { a = y; } else { a = z; }`. Which means you still have a conditional branch, it just isn't *quite* as obvious looking at the source code. The major difference between the two constructs is how they are treated by the compiler. – user Apr 29 '13 at 07:30
  • Though it is settled, I was still a bit restless as to how a bit operation could be slower so I benchmarked it. if Multiplication did the job in 1X time; bit-wise shift and subtraction did it in 2X time and if tests did it in almost 4X time. Winner is multiplication!! – Abhinav Apr 29 '13 at 18:38
  • @MichaelKjörling x86 has CMOVcc and SETcc instructions. A decent compiler will eliminate the conditional jump using these instructions for both code snippets in your comment but not for the original code. The difference lies in the else branch that the original lacks i.e. the original only writes to the array if the value is not equal to zero. The code snippets in your comment always write an array element in each loop iteration. The write to the array is no longer conditional, only the value written to the array changes. This makes it possible to eliminate the conditional branch. – Mackie Messer Apr 29 '13 at 21:05
  • @MackieMesser My point was to illustrate how a conditional-operator assignment can be seen in terms of an if/else block. I'm not sure about CMOVcc and SETcc but will take your word for it for now. :) – user Apr 30 '13 at 11:02
  • I would strongly advise against ever using rand when benchmarking. – Paul Childs Mar 06 '19 at 00:44
  • @PaulChilds I think never using rand in benchmarks is a bit strong. With no idea what OP's real data looks like, rand ended up a good enough guess to [illustrate that the actual data matters](https://stackoverflow.com/questions/16231110/fast-way-to-replace-elements-in-array-c/16231182?noredirect=1#comment23218904_16231182). Previously, it seems [OP used a naive data pattern which produced misleading results.](https://stackoverflow.com/questions/16231110/fast-way-to-replace-elements-in-array-c/16231182?noredirect=1#comment23218499_16231182). – Nicu Stiurca Mar 08 '19 at 10:54
  • If your concern is regarding unpredictability and/or unreliability of the results, note that my benchmark data is in fact fully deterministic since the random seed is left to its (constant) default value, and thus fully reproducible. – Nicu Stiurca Mar 08 '19 at 10:55
15

For a small array such as your it's no use trying to find another algorithm, and if the values are not in a specific pattern a simple loop is the only way to do it anyway.

However, if you have a very large array (we're talking several million entries), then you can split the work into threads. Each separate thread handles a smaller portion of the whole data set.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
13

You might want to benchmark this as well:

for(int i = 0; i < size ; i++){
  array[i] = (~(array[i]-1) & 123456);
}

I run it through same benchmark as SchighSchagh, with little or no difference on my set up. It may differ on yours, however.

EDIT: Stop the presses!

I just remembered that x86 can "unbranch" ternary operators if arguments between ":" are constants. Consider following code:

for(size_t i=0; i<size; ++i) {
    array[i] = array[i] ? 123456 : 0;
}

Looks almost like your original code, doesn't it? Well, disassembly shows that it has been compiled without any branches:

  for(size_t i=0; i<size; ++i) {
00E3104C  xor         eax,eax  
00E3104E  mov         edi,edi  
        array[i] = array[i] ? 123456 : 0;
00E31050  mov         edx,dword ptr [esi+eax*4]  
00E31053  neg         edx  
00E31055  sbb         edx,edx  
00E31057  and         edx,1E240h  
00E3105D  mov         dword ptr [esi+eax*4],edx  
00E31060  inc         eax  
00E31061  cmp         eax,5F5E100h  
00E31066  jb          wmain+50h (0E31050h)  
    }

Performance-wise it seems on par or little better than my original and SchighSchagh solution. It is more readable and more flexible, though. For instance, it can work with array[i] having values different than 0 and 1.

Bottom line, benchmark AND take a peek in the disassembly.

gwiazdorrr
  • 6,181
  • 2
  • 27
  • 36
7

The array is small enough that it fits in cache, so it should be worthwhile to use SIMD: (not tested)

  mov ecx, size
  lea esi, [array + ecx * 4]
  neg ecx
  pxor xmm0, xmm0
  movdqa xmm1, [_vec4_123456]  ; value of { 123456, 123456, 123456, 123456 }
_replaceloop:
  movdqa xmm2, [esi + ecx * 4] ; assumes the array is 16 aligned, make that true
  add ecx, 4
  pcmpeqd xmm2, xmm0
  pandn xmm2, xmm1
  movdqa [esi + ecx * 4 - 16], xmm2
  jnz _replaceloop

Unrolling by 2 might help.

If you have SSE4.1, you can use SchighSchagh's multiplication trick with pmulld.

harold
  • 61,398
  • 6
  • 86
  • 164
  • sure this could help... openCL / accelerate would probably be easier to maintain / port... unrolling by 2 or more would help ... but may not be worth it... – Grady Player Apr 26 '13 at 13:03
  • @GradyPlayer: With OpenCL you have to transfer all the data to the gfx card/accelerating device which is slow, probably much slower than the original suggestion. – Callum Rogers Apr 26 '13 at 16:23
  • This operation (set item to 123456 if not 0) only needs to be done once - after that each successive operation does not change the data. – Callum Rogers Apr 26 '13 at 17:58
3

Here's some Win32 code to profile various versions of the algorithm (compiled using VS2010 Express using default release build):-

#include <windows.h>
#include <stdlib.h>
#include <stdio.h>

const size_t
  size = 0x1D4C00;

_declspec(align(16)) int
  g_array [size];

_declspec(align(16)) int
  _vec4_123456 [] = { 123456, 123456, 123456, 123456 };

void Test (void (*fn) (size_t, int *), char *test)
{
  printf ("Executing test: %s\t", test);

  for(size_t i=0; i<size; ++i) {
    g_array[i] = rand() & 1;
  }

  LARGE_INTEGER
    start,
    end;

  QueryPerformanceCounter (&start);

  fn (size, g_array);

  QueryPerformanceCounter (&end);

  printf("size: %u\t count: %09u\n", size, (int) (end.QuadPart - start.QuadPart));
}

void Test1 (size_t size, int *array)
{
  for(size_t i=0; i<size; ++i) {
    array[i] *= 123456;
  }
}

void Test2 (size_t size, int *array)
{
  for(size_t i=0; i<size; ++i) {
    if(array[i]) array[i] = 123456;
  }
}

void Test3 (size_t array_size, int *array)
{
  __asm
  {
    mov edi,array
    mov ecx, array_size 
    lea esi, [edi + ecx * 4]
    neg ecx
    pxor xmm0, xmm0
    movdqa xmm1, [_vec4_123456]  ; value of { 123456, 123456, 123456, 123456 }
_replaceloop:
    movdqa xmm2, [esi + ecx * 4] ; assumes the array is 16 aligned, make that true
    add ecx, 4
    pcmpeqd xmm2, xmm0
    pandn xmm2, xmm1
    movdqa [esi + ecx * 4 - 16], xmm2
    jnz _replaceloop
  }
}

void Test4 (size_t array_size, int *array)
{
  array_size = array_size * 8 / 12;

  __asm
  {
        mov edi,array
        mov ecx,array_size
        lea esi,[edi+ecx*4]
                                      lea edi,[edi+ecx*4]
        neg ecx
                                      mov edx,[_vec4_123456]
        pxor xmm0,xmm0
        movdqa xmm1,[_vec4_123456]
replaceloop:
        movdqa xmm2,[esi+ecx*4]
                                      mov eax,[edi]
                                      mov ebx,[edi+4]
        movdqa xmm3,[esi+ecx*4+16]
                                      add edi,16
        add ecx,9
                                      imul eax,edx    
        pcmpeqd xmm2,xmm0
                                      imul ebx,edx
        pcmpeqd xmm3,xmm0
                                      mov [edi-16],eax
                                      mov [edi-12],ebx
        pandn xmm2,xmm1
                                      mov eax,[edi-8]
                                      mov ebx,[edi-4]
        pandn xmm3,xmm1
                                      imul eax,edx    
        movdqa [esi+ecx*4-36],xmm2
                                      imul ebx,edx
        movdqa [esi+ecx*4-20],xmm3
                                      mov [edi-8],eax
                                      mov [edi-4],ebx
        loop replaceloop
  }
}

int main()
{
    Test (Test1, "Test1 - mul");
    Test (Test2, "Test2 - branch");
    Test (Test3, "Test3 - simd");
    Test (Test4, "Test4 - simdv2");
}

It's got for tests: C using an if()..., C using a multiply, harold's simd version and my simd version.

Running it many times (remember, when profiling you should average the results over several runs) there's little difference between all the versions except the branching one which is significantly slower.

This is not very surprising as the algortihm is doing very little work for each memory item. What this means is that the real limiting factor is the bandwidth between the CPU and the memory, the CPU is constantly waiting for the memory to catch up, even with the cpu helping with prefetching the data (ia32's detect and prefetch data linearly).

Skizz
  • 69,698
  • 10
  • 71
  • 108
2

This might prove faster.

for(int i = 0; i < size ; i++){
  array[i] = ((123456 << array[i]) - 123456);
}

EDIT: Changed bitwise operation to left shift.

Abhinav
  • 2,085
  • 1
  • 18
  • 31
  • 3
    `0 & 123456 == 0` and `1 & 123456 == 0` – Nicu Stiurca Apr 26 '13 at 07:46
  • 1
    Fix it for you : x= !!(array[i] == 0x01); y = ~x + 1; array[i] = y&123456 + ~y&array[i]; – lucasg Apr 26 '13 at 08:11
  • 1
    I can confirm 5 years on that this answer comes in as benchmarking marginally as the fastest on x86. I believe the difference will be even more significant on ARM and some other RISC processors as in their instruction set they have a single clock cycle operation for performing consecutive bitshift and addition operations – Paul Childs Mar 06 '19 at 00:48
2

You could use another array or some other data structure to keep track of the indices of the elements you set to one and then only visit those elements. This will work best if there are only few elements that are set to one

Sven
  • 22,475
  • 4
  • 52
  • 71
1

one more way to speed up the array assignment you can use the c inline assembly. Like below,

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

const int size = 100000; 
void main(void) {
  int array[size];
  int value = 1000;

  __asm__ __volatile__("cld\n\t"
          "rep\n\t"
          "stosl\n\t"
          :
          :"c"(size*4), "a"(value), "D"(array)
          :
         );

  printf("Array[0] : %d \n", array[0]);
}

This should be speed when we compared to plain c program to assign the array values. And also the stosl instruction take 4 clock cycle.

Mohanraj
  • 4,056
  • 2
  • 22
  • 24
  • -1 This is not the fastest way even to clear an array. (Could be the smallest.) But the problem is about initializing individual / random elements. – Aki Suihkonen Apr 26 '13 at 12:57
  • Hey man, how you are saying that this is slow?. stosl can set the double word specified in eax register to destination pointer(here edi), this should be work upto count specified in ecx register. After done, when you try to access the array element that should a value which is set to eax register. This is what actually happening under the kernel part. – Mohanraj Apr 26 '13 at 13:36
  • 1
    The real problem is that your program doesn't address the question. The secondary problem is that your solution is in different language than what is asked. The third issue is the claim about being the fastest solution. It's not particularly slow either. Sorry for being unclear. The -1 stays. – Aki Suihkonen Apr 26 '13 at 15:27