4

I have been using a small tab function for some time, which shows the frequency, percent, and cumulative percent for a vector. The output looks like this

          Freq    Percent        cum
ARSON      462 0.01988893 0.01988893
BURGLARY 22767 0.98011107 1.00000000
         23229 1.00000000         NA

The excellent dplyr package motivated me to update the function. Now I am wondering how I can make the updated version even faster. Here is the old function

tab = function(x,useNA =FALSE) {
  k=length(unique(x[!is.na(x)]))+1
  if (useNA) k=k+1
  tab=array(NA,c(k,3))
  colnames(tab)=c("freq.","prob.","cum.")
  useNA=ifelse(useNA,"always","no")
  rownames(tab)=names(c(table(x,useNA=useNA),""))

  tab[-nrow(tab),1]=table(x,useNA=useNA)
  tab[-nrow(tab),2]=prop.table(table(x,useNA=useNA))
  tab[,3] = cumsum(tab[,2])
  if(k>2)  tab[nrow(tab),-3]=colSums(tab[-nrow(tab),-3])
  if(k==2) tab[nrow(tab),-3]=tab[-nrow(tab),-3]

  tab
}

and the new based on dplyr

tab2 = function(x, useNA =FALSE) {
    if(!useNA) if(any(is.na(x))) x = na.omit(x)
    n = length(x)
    out = data.frame(x,1) %.%
        group_by(x) %.%
        dplyr::summarise(
            Freq    = length(X1),
            Percent = Freq/n
        ) %.%
        dplyr::arrange(x)
    ids = as.character(out$x)
    ids[is.na(ids)] = '<NA>'
    out = select(out, Freq, Percent)
    out$cum = cumsum(out$Percent)
    class(out)="data.frame"
    out = rbind(out,c(n,1,NA))
    rownames(out) = c(ids,'')
    out
}

Finally, some performance benchmarks:

x1 = c(rep('ARSON',462),rep('BURGLARY',22767))
x2 = c(rep('ARSON',462),rep('BURGLARY',22767),rep(NA,100))
x3 = c(c(1:10),c(1:10),1,4)
x4 = c(rep(c(1:100),500),rep(c(1:50),20),1,4)

library('rbenchmark')

benchmark(tab(x1), tab2(x1), replications=100)[,c('test','elapsed','relative')]
#       test elapsed relative
# 1  tab(x1)   1.412    2.307
# 2 tab2(x1)   0.612    1.000

benchmark(tab(x2),tab2(x2), replications=100)[,c('test','elapsed','relative')]
#       test elapsed relative
# 1  tab(x2)   1.351    1.475
# 2 tab2(x2)   0.916    1.000

benchmark(tab(x2,useNA=TRUE), tab2(x2,useNA=TRUE), replications=100)[,c('test','elapsed','relative')]
#                     test elapsed relative
# 1  tab(x2, useNA = TRUE)   1.883    2.282
# 2 tab2(x2, useNA = TRUE)   0.825    1.000

benchmark(tab(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
#       test elapsed relative
# 1  tab(x3)   0.997    1.000
# 2 tab2(x3)   2.194    2.201

benchmark(tab(x4), tab2(x4), table(x4), replications=100)[,c('test','elapsed','relative')]
#        test elapsed relative
# 1   tab(x4)  19.481   18.714
# 2  tab2(x4)   1.041    1.000
# 3 table(x4)   6.515    6.258

tab2 is faster except for the very short vector. The performance gain becomes evident in the larger vector (see x4 with 51002 obs). It's also faster than table even thought the function is doing much more.

Now to my question: How can I further improve performance? Creating tables with frequencies and percent is a pretty standard application and a fast implementation is very nice when you work with large datasets.

EDIT: Here is an additional test case with a 2e6 vector (including the data.table solution proposed below)

x5 = sample(c(1:100),2e6, replace=TRUE)
benchmark(tab(x5), tab2(x5), table(x5), tabdt(x5), replications=100)[,c('test','elapsed','relative')]
#        test elapsed relative
# 1   tab(x5) 350.878   19.444
# 2  tab2(x5)  52.917    2.932
# 4 tabdt(x5)  18.046    1.000
# 3 table(x5)  98.429    5.454
user2503795
  • 4,035
  • 2
  • 34
  • 49
  • 1
    these are all tiny vectors and take no time to run with base - is this really what you mean by large datasets (or are you running this operation in a loop)? – eddi Jan 31 '14 at 16:43
  • no, my actual data is between 1 and 5 mill rows. These are just test cases and the performance already becomes obvious in with `x4`, which has about 51000 obs) – user2503795 Jan 31 '14 at 17:52
  • 1
    ok, I suggest doing benchmarks on real-sized data, as various options can scale very differently from 50k to 5M – eddi Jan 31 '14 at 17:58
  • working on it right now and will update with a new case – user2503795 Jan 31 '14 at 18:00

1 Answers1

8

As I'm a big fan of library(data.table) I wrote similar function:

tabdt <- function(x){
    n <- length(which(!is.na(x)))
    dt <- data.table(x)
    out <- dt[, list(Freq = .N, Percent = .N / n), by = x]
    out[!is.na(x), CumSum := cumsum(Percent)]
    out
}

> benchmark(tabdt(x1), tab2(x1), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x1)    5.60    1.879
1 tabdt(x1)    2.98    1.000
> benchmark(tabdt(x2), tab2(x2), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x2)    6.34    1.686
1 tabdt(x2)    3.76    1.000
> benchmark(tabdt(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x3)    1.65    1.000
1 tabdt(x3)    2.34    1.418
> benchmark(tabdt(x4), tab2(x4), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x4)   14.35    1.000
1 tabdt(x4)   22.04    1.536

And so data.table approach was faster for x1 and x2 while dplyr was faster for x3 and x4. Actually I don't see any room for improvement using these approaches.

p.s. Would you add data.table keyword to this question? I believe people would love to see dplyr vs. data.table performance comparison (see data.table vs dplyr: can one do something well the other can't or does poorly? for example).

Community
  • 1
  • 1
danas.zuokas
  • 4,551
  • 4
  • 29
  • 39
  • Do you mind updating your answer with the actual benchmarks? Unfortunately I'm having a hell of a time installing `dplyr` so I can't run them side by side (and confirm they are actually producing the same output). – BrodieG Jan 31 '14 at 13:41
  • 3
    @BrodieG what do you mean, you have hell of a time installing dplyr. What happens when you do `install.packages("dplyr")` – Romain Francois Jan 31 '14 at 14:00
  • 2
    @RomainFrancois, for some reason (I could swear I read that someplace) I though this was a github release only so far and was running into issues with suggest dependencies. The normal install works fine (need shoots-self-in-head emoticon). – BrodieG Jan 31 '14 at 14:25
  • Nice! I have added the keyword. I would argue that `tab2` does better though because its faster when it counts (for longer vectors). Even `x4` is not particular long---the others are just super short and will run very quickly anyway. – user2503795 Jan 31 '14 at 17:55
  • I have to take that back. Just added a new benchmark with 2e6 obs and your approach is faster. But your solution doesn't offer the `useNA` option and does not add a total column... – user2503795 Jan 31 '14 at 18:16
  • Actually it does handle `NA` values. You can see it by examining the output of `tabdt` function. – danas.zuokas Feb 01 '14 at 12:31
  • 1
    Your solution can be slightly improved in terms of performance: `tabdt2 <- function(x){ NnotNA <- sum(!is.na(x)); setnames(setDT(list(x)),"x")[,list(Freq = .N, Percent = .N / NnotNA), by = x][!is.na(x), CumSum := cumsum(Percent)] }` – jangorecki Jan 05 '15 at 18:08