15

For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?

I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon.

Example code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big; I chose a total size just under 2.5GB (2.4GB). I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that helps, but type conversion can take time).

set.seed(0)
N = 3E8
p = 0.02
x = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

Some obvious methods:

  1. table
  2. bigtabulate
  3. Simple logical operations (e.g. sum(x & y))
  4. Vector multiplication (boo)
  5. data.table
  6. Some of the above, with parallel from the multicore package (or the new parallel package)

I've taken a stab at the first three options (see my answer), but I feel that there must be something better and faster.

I find that table works very slowly. bigtabulate seems like overkill for a pair of logical vectors. Finally, doing the vanilla logical operations seems like a kludge, and it looks at each vector too many times (3X? 7X?), not to mention that it fills a lot of additional memory during processing, which is a massive time waster.

Vector multiplication is usually a bad idea, but when the vector is sparse, one may get an advantage out of storing it as such, and then using vector multiplication.

Feel free to vary N and p, if that will demonstrate anything interesting behavior of the tabulation functions. :)


Update 1. My first answer gives timings on three naive methods, which is the basis for believing table is slow. Yet, the key thing to realize is that the "logical" method is grossly inefficient. Look at what it's doing:

  • 4 logical vector operations
  • 4 type conversions (logical to integer or FP - for sum)
  • 4 vector summations
  • 8 assignments (1 for the logical operation, 1 for the summation)

Not only that, but it's not even compiled or parallelized. Yet, it still beats the pants off of table. Notice that bigtabulate, with an extra type conversion (1 * cbind...) still beats table.

Update 2. Lest anyone point out that logical vectors in R support NA, and that that will be a wrench in the system for these cross tabulations (which is true in most cases), I should point out that my vectors come from is.na() or is.finite(). :) I've been debugging NA and other non-finite values - they've been a headache for me recently. If you don't know whether or not all of your entries are NA, you could test with any(is.na(yourVector)) - this would be wise before you adopt some of the ideas arising in this Q&A.


Update 3. Brandon Bertelsen asked a very reasonable question in the comments: why use so much data when a sub-sample (the initial set, after all, is a sample ;-)) might be adequate for the purposes of creating a cross-tabulation? Not to drift too far into statistics, but the data arises from cases where the TRUE observations are very rare, for both variables. One is a result of a data anomaly, the other due to a possible bug in code (possible bug because we only see the computational result - think of variable x as "Garbage In", and y as "Garbage Out". AS a result, the question is whether the issues in the output caused by the code are solely those cases where the data is anomalous, or are there some other instances where good data goes bad? (This is why I asked a question about stopping when a NaN, NA, or Inf is encountered.)

That also explains why my example has a low probability for TRUE values; these really occur much less than 0.1% of the time.

Does this suggest a different solution path? Yes: it suggests that we may use two indices (i.e. the locations of TRUE in each set) and count set intersections. I avoided set intersections because I was burned awhile back by Matlab (yes, this is R, but bear with me), which would first sort elements of a set before it does an intersection. (I vaguely recall the complexity was even more embarrassing: like O(n^2) instead of O(n log n).)

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111
  • I'm puzzled why `table` seems slow to you. It's always been quick when I used it. (Admittedly it took 5 minutes on your task.) – IRTFM Feb 07 '12 at 05:54
  • @DWin: Sorry I didn't respond earlier, I was *waiting* on `table`. :) See my results below. The results for `table` are just abysmal. It was beaten by the method of logical vectors, which is itself a very naive and very wasteful method - too many memory accesses, floating point calculations & type conversions, not parallelized, ... the horror. Yet, it is still faster than `table`. – Iterator Feb 07 '12 at 06:08
  • Yes. I was surprised too. My logical vector version was sum(x>y), sum(x – IRTFM Feb 07 '12 at 06:35
  • @DWin That's right - the 4th operation can be removed by simply subtracting from the length of the vector. Nonetheless, it's still 2 logical operations too many, given that we know we're going to do a tabulation. The simplicity of this is bedeviling. – Iterator Feb 07 '12 at 06:39
  • @DWin Well, we now have an estimated 80X speedup over `table()`. :) I now believe that it's feasible to improve at least another 10X. – Iterator Feb 07 '12 at 14:10
  • I know this is a programming question about big data... but would it be ridiculous of me to suggest making a few assumptions and sampling from the vectors and accepting a certain level of confidence rather than trying to be perfect? ie, something like `table(sample(x,3e6),sample(y,3e6))` – Brandon Bertelsen Feb 08 '12 at 14:51
  • @BrandonBertelsen That would be too clever by half, but I should address that in a revision to the question. As a teaser, though: keep in mind that for testing some hypotheses you need a larger sample. :) – Iterator Feb 08 '12 at 15:00

5 Answers5

11

If you're doing a lot of operations on huge logical vectors, take a look at the bit package. It saves a ton of memory by storing the booleans as true 1-bit booleans.

This doesn't help with table; it actually makes it worse because there are more unique values in the bit vector due to how it's constructed. But it really helps with logical comparisons.

# N <- 3e7
require(bit)
xb <- as.bit(x)
yb <- as.bit(y)
benchmark(replications = 1, order = "elapsed", 
    bit = {res <- func_logical(xb,yb)},
    logical = {res <- func_logical(x,y)}
)
#      test replications elapsed relative user.self sys.self user.child sys.child
# 1     bit            1   0.129  1.00000     0.132    0.000          0         0
# 2 logical            1   3.677 28.50388     2.684    0.928          0         0
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 1
    Thanks! This is also the kind of serious improvement that can and should be achieved. Doing the bit-vector conversion consumes about 50% more time (versus using an already-supplied bit vector), but that's worth the penalty for the serious speedup relative to the "logical" version. – Iterator Feb 07 '12 at 13:40
9

Here are results for the logical method, table, and bigtabulate, for N = 3E8:

         test replications elapsed relative user.self sys.self
2     logical            1  23.861 1.000000     15.36     8.50
3 bigtabulate            1  36.477 1.528729     28.04     8.43
1       table            1 184.652 7.738653    150.61    33.99

In this case, table is a disaster.

For comparison, here is N = 3E6:

         test replications elapsed relative user.self sys.self
2     logical            1   0.220 1.000000      0.14     0.08
3 bigtabulate            1   0.534 2.427273      0.45     0.08
1       table            1   1.956 8.890909      1.87     0.09

At this point, it seems that writing one's own logical functions is best, even though that abuses sum, and examines each logical vector multiple times. I've not yet tried compiling the functions, but that should yield better results.

Update 1 If we give bigtabulate values that are already integers, i.e. if we do the type conversion 1 * cbind(v1,v2) outside of bigtabulate, then the N=3E6 multiple is 1.80, instead of 2.4. The N=3E8 multiple relative to the "logical" method is only 1.21, instead of 1.53.


Update 2

As Joshua Ulrich has pointed out, converting to bit vectors is a significant improvement - we're allocating and moving around a LOT less data: R's logical vectors consume 4 bytes per entry ("Why?", you may ask... Well, I don't know, but an answer may turn up here.), whereas a bit vector consumes, well, one bit, per entry - i.e. 1/32 as much data. So, x consumes 1.2e9 bytes, while xb (the bit version in the code below) consumes only 3.75e7 bytes.

I've dropped table and the bigtabulate variations from the updated benchmarks (N=3e8). Note that logicalB1 assumes that the data is already a bit vector, while logicalB2 is the same operation with the penalty for type conversion. As my logical vectors are the results of operations on other data, I don't have the benefit of starting off with a bit vector. Nonetheless, the penalty to be paid is relatively small. [The "logical3" series only performs 3 logical operations, and then does a subtraction. Since it's cross-tabulation, we know the total, as DWin has remarked.]

        test replications elapsed  relative user.self sys.self
4 logical3B1            1   1.276  1.000000      1.11     0.17
2  logicalB1            1   1.768  1.385580      1.56     0.21
5 logical3B2            1   2.297  1.800157      2.15     0.14
3  logicalB2            1   2.782  2.180251      2.53     0.26
1    logical            1  22.953 17.988245     15.14     7.82

We've now sped this up to taking only 1.8-2.8 seconds, even with many gross inefficiencies. There is no doubt it should be feasible to do this in well under 1 second, with changes including one or more of: C code, compilation, and multicore processing. After all the 3 (or 4) different logical operations could be done independently, even though that's still a waste of compute cycles.

The most similar of the best challengers, logical3B2, is about 80X faster than table. It's about 10X faster than the naive logical operation. And it still has a lot of room for improvement.


Here is code to produce the above. NOTE I recommend commenting out some of the operations or vectors, unless you have a lot of RAM - the creation of x, x1, and xb, along with the corresponding y objects, will take up a fair bit of memory.

Also, note: I should have used 1L as the integer multiplier for bigtabulate, instead of just 1. At some point I will re-run with this change, and would recommend that change to anyone who uses the bigtabulate approach.

library(rbenchmark)
library(bigtabulate)
library(bit)

set.seed(0)
N <- 3E8
p <- 0.02

x <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

x1 <- 1*x
y1 <- 1*y

xb <- as.bit(x)
yb <- as.bit(y)

func_table  <- function(v1,v2){
    return(table(v1,v2))
}

func_logical  <- function(v1,v2){
    return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}

func_logicalB  <- function(v1,v2){
    v1B <- as.bit(v1)
    v2B <- as.bit(v2)
    return(c(sum(v1B & v2B), sum(v1B & !v2B), sum(!v1B & v2B), sum(!v1B & !v2B)))
}

func_bigtabulate    <- function(v1,v2){
    return(bigtabulate(1*cbind(v1,v2), ccols = c(1,2)))
}

func_bigtabulate2    <- function(v1,v2){
    return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}

func_logical3   <- function(v1,v2){
    r1  <- sum(v1 & v2)
    r2  <- sum(v1 & !v2)
    r3  <- sum(!v1 & v2)
    r4  <- length(v1) - sum(c(r1, r2, r3))
    return(c(r1, r2, r3, r4))
}

func_logical3B   <- function(v1,v2){
    v1B <- as.bit(v1)
    v2B <- as.bit(v2)
    r1  <- sum(v1B & v2B)
    r2  <- sum(v1B & !v2B)
    r3  <- sum(!v1B & v2B)
    r4  <- length(v1) - sum(c(r1, r2, r3))
    return(c(r1, r2, r3, r4))
}

benchmark(replications = 1, order = "elapsed", 
    #table = {res <- func_table(x,y)},
    logical = {res <- func_logical(x,y)},
    logicalB1 = {res <- func_logical(xb,yb)},
    logicalB2 = {res <- func_logicalB(x,y)},

    logical3B1 = {res <- func_logical3(xb,yb)},
    logical3B2 = {res <- func_logical3B(x,y)}

    #bigtabulate = {res <- func_bigtabulate(x,y)},
    #bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)
Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111
  • 2
    +1 Very interesting. FWIW, `tabulate()` turns out to be **much** faster than (and is in fact called by) `table()`. `tabulate(1 + 1L*x + 2L*y)` is competitive with, but still 5-10% slower than, your `func_logical()`. – Josh O'Brien Feb 07 '12 at 06:35
  • The mapping to integers is another trick I didn't yet try. After a certain point, it becomes painful to wait so long, and I was nervous about the pedigree of `table`. :) This entire tabulation should really take less than a second on a modern processor. Also, as DWin pointed out, the `func_logical()` is illustrative - the 4th summation can be dropped and we can just subtract the total of the other 3 values from the length of the vectors. – Iterator Feb 07 '12 at 06:43
  • @JoshO'Brien I just ran your suggestion. See my second answer, which is primarily about going after the indices, rather than working in with the vectors themselves. The speed is good, but the best methods are now >10X faster. :) I still think there's a 10X improvement feasible here, but 1 second is "almost" interactive. – Iterator Feb 08 '12 at 16:01
3

Here is an answer using Rcpp sugar.

N <- 1e8
x <- sample(c(T,F),N,replace=T)
y <- sample(c(T,F),N,replace=T)

func_logical  <- function(v1,v2){
    return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}


library(Rcpp)
library(inline)

doCrossTab1 <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::LogicalVector Vx(x);
  Rcpp::LogicalVector Vy(y);
  Rcpp::IntegerVector V(4);

  V[0] = sum(Vx*Vy);
  V[1] = sum(Vx*!Vy);
  V[2] = sum(!Vx*Vy);
  V[3] = sum(!Vx*!Vy);
  return( wrap(V));
  '
, plugin="Rcpp")

system.time(doCrossTab1(x,y))

require(bit)
system.time(
{
xb <- as.bit(x)
yb <- as.bit(y)
func_logical(xb,yb)
})

which results in:

> system.time(doCrossTab1(x,y))
   user  system elapsed 
  1.067   0.002   1.069 
> system.time(
+ {
+ xb <- as.bit(x)
+ yb <- as.bit(y)
+ func_logical(xb,yb)
+ })
   user  system elapsed 
  1.451   0.001   1.453 

So, we can get a little speed up over the bit package, though I'm surprised at how competitive the times are.

Update: In honor of Iterator, here is a Rcpp iterator solution:

doCrossTab2 <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::LogicalVector Vx(x);
  Rcpp::LogicalVector Vy(y);
  Rcpp::IntegerVector V(4);
    V[0]=V[1]=V[2]=V[3]=0;
  LogicalVector::iterator itx = Vx.begin();
  LogicalVector::iterator ity = Vy.begin();
  while(itx!=Vx.end()){
    V[0] += (*itx)*(*ity);
    V[1] += (*itx)*(!*ity);
    V[2] += (!*itx)*(*ity);
    V[3] += (!*itx)*(!*ity);    
    itx++;
    ity++;
  }
  return( wrap(V));
  '
, plugin="Rcpp")

system.time(doCrossTab2(x,y))
#   user  system elapsed 
#  0.780   0.001   0.782 
Ian Fellows
  • 17,228
  • 10
  • 49
  • 63
  • (Deleted prior comments - misread timings.) These are quite interesting #s. Rcpp sugar certainly looks helpful; I tinkered with it earlier to see if I could improve via logical operations (I presume that the `*` operation incurs costs for integer type conversion + integer multiplication), but didn't have much bandwidth for this today. I'm curious if Dirk or Romain might pop by with some insights on improving this further. – Iterator Feb 14 '12 at 03:50
  • Your iterator solution is quite appealing and instructive. At this point I have to wonder what could possibly beat an iterator. ;-) I look forward to tinkering with it. Depending on how smart the compiler is, it could be possible improve the timing by avoiding some of the calculations, on average (i.e. by making use of the fact that the sample is very sparse, so only update 1 entry when FALSE-FALSE is encountered), and by dropping `V[3]`, which can be determined from the other values + the length. – Iterator Feb 14 '12 at 06:35
2

A different tactic is to consider just set intersections, using the indices of the TRUE values, taking advantage that the samples are very biased (i.e. mostly FALSE).

To that end, I introduce func_find01 and a translation that uses the bit package (func_find01B); all of the code that doesn't appear in the answer above is pasted below.

I re-ran the full N=3e8 evaluation, except forgot to use func_find01B; I reran the faster methods against it, in a second pass.

          test replications elapsed   relative user.self sys.self
6   logical3B1            1   1.298   1.000000      1.13     0.17
4    logicalB1            1   1.805   1.390601      1.57     0.23
7   logical3B2            1   2.317   1.785054      2.12     0.20
5    logicalB2            1   2.820   2.172573      2.53     0.29
2       find01            1   6.125   4.718798      4.24     1.88
9 bigtabulate2            1  22.823  17.583205     21.00     1.81
3      logical            1  23.800  18.335901     15.51     8.28
8  bigtabulate            1  27.674  21.320493     24.27     3.40
1        table            1 183.467 141.345917    149.01    34.41

Just the "fast" methods:

        test replications elapsed relative user.self sys.self
3     find02            1   1.078 1.000000      1.03     0.04
6 logical3B1            1   1.312 1.217069      1.18     0.13
4  logicalB1            1   1.797 1.666976      1.58     0.22
2    find01B            1   2.104 1.951763      2.03     0.08
7 logical3B2            1   2.319 2.151206      2.13     0.19
5  logicalB2            1   2.817 2.613173      2.50     0.31
1     find01            1   6.143 5.698516      4.21     1.93

So, find01B is fastest among methods that do not use pre-converted bit vectors, by a slim margin (2.099 seconds versus 2.327 seconds). Where did find02 come from? I subsequently wrote a version that uses pre-computed bit vectors. This is now the fastest.

In general, the running time of the "indices method" approach may be affected by the marginal & joint probabilities. I suspect that it would be especially competitive when the probabilities are even lower, but one has to know that a priori, or via a sub-sample.


Update 1. I've also timed Josh O'Brien's suggestion, using tabulate() instead of table(). The results, at 12 seconds elapsed, are about 2X find01 and about half of bigtabulate2. Now that the best methods are approaching 1 second, this is also relatively slow:

 user  system elapsed 
7.670   5.140  12.815 

Code:

func_find01 <- function(v1, v2){
    ix1 <- which(v1 == TRUE)
    ix2 <- which(v2 == TRUE)

    len_ixJ <- sum(ix1 %in% ix2)
    len1    <- length(ix1)
    len2    <- length(ix2)
    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1) - len1 - len2 + len_ixJ))
}

func_find01B <- function(v1, v2){
    v1b = as.bit(v1)
    v2b = as.bit(v2)

    len_ixJ <- sum(v1b & v2b)
    len1 <- sum(v1b)
    len2 <- sum(v2b)

    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1) - len1 - len2 + len_ixJ))
}

func_find02 <- function(v1b, v2b){
    len_ixJ <- sum(v1b & v2b)
    len1 <- sum(v1b)
    len2 <- sum(v2b)

    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1b) - len1 - len2 + len_ixJ))
}

func_bigtabulate2    <- function(v1,v2){
    return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}

func_tabulate01 <- function(v1,v2){
    return(tabulate(1L + 1L*x + 2L*y))
}

benchmark(replications = 1, order = "elapsed", 
    table = {res <- func_table(x,y)},
    find01  = {res <- func_find01(x,y)},
    find01B  = {res <- func_find01B(x,y)},
    find02  = {res <- func_find01B(xb,yb)},
    logical = {res <- func_logical(x,y)},
    logicalB1 = {res <- func_logical(xb,yb)},
    logicalB2 = {res <- func_logicalB(x,y)},

    logical3B1 = {res <- func_logical3(xb,yb)},
    logical3B2 = {res <- func_logical3B(x,y)},

    tabulate    = {res <- func_tabulate(x,y)},
    bigtabulate = {res <- func_bigtabulate(x,y)},
    bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)
Samuel-Rosa
  • 339
  • 3
  • 10
Iterator
  • 20,250
  • 12
  • 75
  • 111
0

Here's an answer with Rcpp, tabulating only those entries that are not both 0. I suspect there must be several ways to improve this, as this is unusually slow; it's also my first attempt with Rcpp, so there may be some obvious inefficiencies associated with moving the data around. I wrote an example that is purposefully plain vanilla, which should let others demonstrate how this can be improved.

library(Rcpp)
library(inline)
doCrossTab <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::IntegerVector Vx(x);
  Rcpp::IntegerVector Vy(y);
  Rcpp::IntegerVector V(3);
  for(int i = 0; i < Vx.length(); i++) {
    if( (Vx(i) == 1) & ( Vy(i) == 1) ){ V[0]++; } 
    else if( (Vx(i) == 1) & ( Vy(i) == 0) ){ V[1]++; } 
    else if( (Vx(i) == 0) & ( Vy(i) == 1) ){ V[2]++; } 
 }
  return( wrap(V));
  ', plugin="Rcpp")

Timing results for N = 3E8:

   user  system elapsed 
 10.930   1.620  12.586 

This takes more than 6X as long as func_find01B in my 2nd answer.

Iterator
  • 20,250
  • 12
  • 75
  • 111