For two logical vectors, x
and y
, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?
I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon.
Example code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big; I chose a total size just under 2.5GB (2.4GB). I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that helps, but type conversion can take time).
set.seed(0)
N = 3E8
p = 0.02
x = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
Some obvious methods:
table
bigtabulate
- Simple logical operations (e.g.
sum(x & y)
) - Vector multiplication (boo)
data.table
- Some of the above, with
parallel
from themulticore
package (or the newparallel
package)
I've taken a stab at the first three options (see my answer), but I feel that there must be something better and faster.
I find that table
works very slowly. bigtabulate
seems like overkill for a pair of logical vectors. Finally, doing the vanilla logical operations seems like a kludge, and it looks at each vector too many times (3X? 7X?), not to mention that it fills a lot of additional memory during processing, which is a massive time waster.
Vector multiplication is usually a bad idea, but when the vector is sparse, one may get an advantage out of storing it as such, and then using vector multiplication.
Feel free to vary N
and p
, if that will demonstrate anything interesting behavior of the tabulation functions. :)
Update 1. My first answer gives timings on three naive methods, which is the basis for believing table
is slow. Yet, the key thing to realize is that the "logical" method is grossly inefficient. Look at what it's doing:
- 4 logical vector operations
- 4 type conversions (logical to integer or FP - for
sum
) - 4 vector summations
- 8 assignments (1 for the logical operation, 1 for the summation)
Not only that, but it's not even compiled or parallelized. Yet, it still beats the pants off of table
. Notice that bigtabulate
, with an extra type conversion (1 * cbind...
) still beats table
.
Update 2. Lest anyone point out that logical vectors in R support NA
, and that that will be a wrench in the system for these cross tabulations (which is true in most cases), I should point out that my vectors come from is.na()
or is.finite()
. :) I've been debugging NA
and other non-finite values - they've been a headache for me recently. If you don't know whether or not all of your entries are NA
, you could test with any(is.na(yourVector))
- this would be wise before you adopt some of the ideas arising in this Q&A.
Update 3. Brandon Bertelsen asked a very reasonable question in the comments: why use so much data when a sub-sample (the initial set, after all, is a sample ;-)) might be adequate for the purposes of creating a cross-tabulation? Not to drift too far into statistics, but the data arises from cases where the TRUE
observations are very rare, for both variables. One is a result of a data anomaly, the other due to a possible bug in code (possible bug because we only see the computational result - think of variable x
as "Garbage In", and y
as "Garbage Out". AS a result, the question is whether the issues in the output caused by the code are solely those cases where the data is anomalous, or are there some other instances where good data goes bad? (This is why I asked a question about stopping when a NaN
, NA
, or Inf
is encountered.)
That also explains why my example has a low probability for TRUE
values; these really occur much less than 0.1% of the time.
Does this suggest a different solution path? Yes: it suggests that we may use two indices (i.e. the locations of TRUE
in each set) and count set intersections. I avoided set intersections because I was burned awhile back by Matlab (yes, this is R, but bear with me), which would first sort elements of a set before it does an intersection. (I vaguely recall the complexity was even more embarrassing: like O(n^2)
instead of O(n log n)
.)