R diff function on large dataset

Question

I'd like to use the diff function on a really big data.frame : 140 Millions rows and two columns.

The goal is to compute the gap between two consecutive date activity, for each user_id. For each user, the first activity doesn't have previous one, so I need a NA value.

I used this function, and it works for small dataset, but with the big one, it's really slow. I'm waiting since yesterday, and it's still running.

df2 <- as.vector(unlist(tapply(df$DATE,df$user_id, FUN=function(x){ return (c(NA,diff(x)))})))

I have a lot of memory (24GO) and a 4 cores CPU, but only one is working.

How can we do to manage this problem ? Is it better if I convert the dataframe to a matrix ?

I'd look into using `data.table` for problems involving datasets of this magnitude. You will see a substantial speed improvement converting your df to a data.table (`DT <- data.table(df)`). — Simon O'Hanlon, May 03 '13 at 13:23
Multiply the time it takes for your small dataset by the ratios of small to big dataset. Has that much time elapsed? To answer your other question, in general matrix objects can be processed faster than dataframe objects. — Carl Witthoft, May 03 '13 at 13:25
@CarlWitthoft matrices will be *much* faster, but R will still make background copies of the data right? And that will be *one* of the things slowing it down on a dataset so big. — Simon O'Hanlon, May 03 '13 at 13:29
Can you confirm you can do it (on a simple vector of the same size) on 4 sec only ? Because it's not my case. — Ricol, May 03 '13 at 13:45
@SimonO101 true enough. My calculation provides a lower limit on processing time, at least. — Carl Witthoft, May 03 '13 at 14:28

score 4 · Answer 1 · edited May 23 '17 at 12:16

Here is an example using some example data on a dataset that is at first 10 million rows, with 100 users, diffing 100,000 time points each, then 140 million rows, with 1,400 users so same number of timepoints. This transposes the time points to the columns. I should imagine if you were transposing users to columns it would be even faster. I used @Arun 's answer here as a template. Basically it shows that on a really big table you can do it on a single core (i7 2.6 GhZ) in < 90 seconds (and that is using code which is probably not fully optimsied):

require(data.table)

## Smaller sample dataset - 10 million row, 100 users, 100,000 time points each
DT <- data.table( Date = sample(100,1e7,repl=TRUE) , User = rep(1:100,each=1e5) )

## Size of table in memory
tables()
#    NAME       NROW MB COLS      KEY
#[1,] DT   10,000,000 77 Date,User    
#Total: 77MB


## Diff by user
dt.test <- quote({
    DT2 <- DT[ , list(Diff=diff(c(0,Date))) , by=list(User) ]
    DT2 <- DT2[, as.list(setattr(Diff, 'names', 1:length(Diff))) , by = list(User)]
    })


## Benchmark it
require(microbenchmark)
microbenchmark( eval(dt.test) , times = 5L )
#Unit: seconds
#         expr      min       lq median       uq      max neval
# eval(dt.test) 5.788364 5.825788 5.9295 5.942959 6.109157     5

## And with 140 million rows...
DT <- data.table( Date = sample(100,1.4e8,repl=TRUE) , User = rep(1:1400,each=1e5) )
#tables()
#    NAME        NROW   MB
#[1,] DT   140,000,000 1069

microbenchmark( eval(dt.test) , times = 1L )
#Unit: seconds
#         expr     min      lq  median      uq     max neval
# eval(dt.test) 84.3689 84.3689 84.3689 84.3689 84.3689     1

score 2 · Answer 2 · answered May 03 '13 at 14:19

This is a lot faster if you avoid tapply all together, which is fairly easy because your tapply call assumes the data are already sorted by user_id and DATE.

set.seed(21)
N <- 1e6
Data <- data.frame(DATE=Sys.Date()-sample(365,N,TRUE),
                   USER=sample(1e3,N,TRUE))
Data <- Data[order(Data$USER,Data$DATE),]
system.time({
  Data$DIFF <- unlist(tapply(Data$DATE,Data$USER, function(x) c(NA,diff(x))))
})
#   user  system elapsed 
#   1.58    0.00    1.59
Data2 <- Data
system.time({
  Data2$DIFF <- c(NA,diff(Data2$DATE))
  is.na(Data2$DIFF) <- which(c(NA,diff(Data2$USER))==1)
})
#   user  system elapsed 
#   0.12    0.00    0.12
identical(Data,Data2)
# [1] TRUE

R diff function on large dataset

2 Answers2