I'm an R developer who uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.
In R, our Boolean type can actually have three values, true
, false
, and na
, and we represent this using an int
at the C level.
I'm looking into a vectorized &&
operation (yes, we have this in R already, but bear with me) that also handles the na
case. The scalar results would look like this:
F && F == F
F && T == F
F && N == F
T && F == F
T && T == T
T && N == N
N && F == F
N && T == N
N && N == N
Note that it works like &&
in C, except that na
values propagate when combined with anything except false
, in which case we "know" that &&
can never be true, so we return false
.
Now to the implementation. Assume we have two vectors, v_out
and v_x
, and we'd like to perform the vectorized &&
on them. We are allowed to overwrite v_out
with the result. One option is:
// Option 1
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
And another option is:
// Option 2
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
if (elt_out == 0) {
continue;
}
int elt_x = v_x[i];
if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
I sort of expected the second option to be faster, because it avoids accessing v_x[i]
when it isn't required. But in fact it was twice as slow when compiled with -O2
!
In the following script, I get the following timing results. Note that I am on a Mac and compile with Clang.
It seems reasonable with O0. They are about the same.
2x faster with O2 with Option 1!
Option 1, `clang -O0`
0.110560
Option 2, `clang -O0`
0.107710
Option 1, `clang -O2`
0.032223
Option 2, `clang -O2`
0.070557
What is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i]
is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i]
is essentially being accessed randomly (sort of), because it might access v_x[10]
, but then not need another element from v_x
until v_x[120]
, and because that access isn't linear, it is probably much slower.
Reproducible script:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>
int main() {
srand(123);
int size = 1e7;
int na = INT_MIN;
int* v_out = (int*) malloc(size * sizeof(int));
int* v_x = (int*) malloc(size * sizeof(int));
// Generate random numbers between 1-3
// 1 -> false
// 2 -> true
// 3 -> na
for (int i = 0; i < size; ++i) {
int elt_out = rand() % 3 + 1;
if (elt_out == 1) {
v_out[i] = 0;
} else if (elt_out == 2) {
v_out[i] = 1;
} else {
v_out[i] = na;
}
int elt_x = rand() % 3 + 1;
if (elt_x == 1) {
v_x[i] = 0;
} else if (elt_x == 2) {
v_x[i] = 1;
} else {
v_x[i] = na;
}
}
clock_t start = clock();
// Option 1
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
// // Option 2
// for (int i = 0; i < size; ++i) {
// int elt_out = v_out[i];
//
// if (elt_out == 0) {
// continue;
// }
//
// int elt_x = v_x[i];
//
// if (elt_x == 0) {
// v_out[i] = 0;
// } else if (elt_out == na) {
// // Done
// } else if (elt_x == na) {
// v_out[i] = na;
// }
// }
clock_t end = clock();
double time = (double) (end - start) / CLOCKS_PER_SEC;
free(v_out);
free(v_x);
printf("%f\n", time);
return 0;
}
Based on a few questions in the comments, here are a few points of clarifications for future readers:
I am on a 2018 15-inch MacBook Pro with a 2.9 GHz 6-core Intel i9-8950HK (6 core Coffee Lake.)
My particular Clang version that I tested with is
Apple clang version 13.1.6 (clang-1316.0.21.2.5)
withTarget: x86_64-apple-darwin21.6.0
I am restricted by R to use
int
as the data type (even though there are more efficient options) and the following coding:false = 0
,true = 1
,na = INT_MIN
. The reproducible example that I provided respects this.The original question was not actually a request to make the code run faster. I just wanted to get an idea of what the difference was between my two if/else approaches. That said, some answers have shown that branchless approaches can be much faster, and I really appreciate the explanations that those users have provided! That has greatly influenced the final version of the implementation I am working on.