1

I'd like to use the diff function on a really big data.frame : 140 Millions rows and two columns.

The goal is to compute the gap between two consecutive date activity, for each user_id. For each user, the first activity doesn't have previous one, so I need a NA value.

I used this function, and it works for small dataset, but with the big one, it's really slow. I'm waiting since yesterday, and it's still running.

df2 <- as.vector(unlist(tapply(df$DATE,df$user_id, FUN=function(x){ return (c(NA,diff(x)))})))

I have a lot of memory (24GO) and a 4 cores CPU, but only one is working.

How can we do to manage this problem ? Is it better if I convert the dataframe to a matrix ?

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
Ricol
  • 377
  • 4
  • 9
  • 22
  • 1
    I'd look into using `data.table` for problems involving datasets of this magnitude. You will see a substantial speed improvement converting your df to a data.table (`DT <- data.table(df)`). – Simon O'Hanlon May 03 '13 at 13:23
  • 1
    Multiply the time it takes for your small dataset by the ratios of small to big dataset. Has that much time elapsed? To answer your other question, in general matrix objects can be processed faster than dataframe objects. – Carl Witthoft May 03 '13 at 13:25
  • @CarlWitthoft matrices will be *much* faster, but R will still make background copies of the data right? And that will be *one* of the things slowing it down on a dataset so big. – Simon O'Hanlon May 03 '13 at 13:29
  • Can you confirm you can do it (on a simple vector of the same size) on 4 sec only ? Because it's not my case. – Ricol May 03 '13 at 13:45
  • @SimonO101 true enough. My calculation provides a lower limit on processing time, at least. – Carl Witthoft May 03 '13 at 14:28

2 Answers2

4

Here is an example using some example data on a dataset that is at first 10 million rows, with 100 users, diffing 100,000 time points each, then 140 million rows, with 1,400 users so same number of timepoints. This transposes the time points to the columns. I should imagine if you were transposing users to columns it would be even faster. I used @Arun 's answer here as a template. Basically it shows that on a really big table you can do it on a single core (i7 2.6 GhZ) in < 90 seconds (and that is using code which is probably not fully optimsied):

require(data.table)

## Smaller sample dataset - 10 million row, 100 users, 100,000 time points each
DT <- data.table( Date = sample(100,1e7,repl=TRUE) , User = rep(1:100,each=1e5) )

## Size of table in memory
tables()
#    NAME       NROW MB COLS      KEY
#[1,] DT   10,000,000 77 Date,User    
#Total: 77MB


## Diff by user
dt.test <- quote({
    DT2 <- DT[ , list(Diff=diff(c(0,Date))) , by=list(User) ]
    DT2 <- DT2[, as.list(setattr(Diff, 'names', 1:length(Diff))) , by = list(User)]
    })


## Benchmark it
require(microbenchmark)
microbenchmark( eval(dt.test) , times = 5L )
#Unit: seconds
#         expr      min       lq median       uq      max neval
# eval(dt.test) 5.788364 5.825788 5.9295 5.942959 6.109157     5

## And with 140 million rows...
DT <- data.table( Date = sample(100,1.4e8,repl=TRUE) , User = rep(1:1400,each=1e5) )
#tables()
#    NAME        NROW   MB
#[1,] DT   140,000,000 1069

microbenchmark( eval(dt.test) , times = 1L )
#Unit: seconds
#         expr     min      lq  median      uq     max neval
# eval(dt.test) 84.3689 84.3689 84.3689 84.3689 84.3689     1
Community
  • 1
  • 1
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
2

This is a lot faster if you avoid tapply all together, which is fairly easy because your tapply call assumes the data are already sorted by user_id and DATE.

set.seed(21)
N <- 1e6
Data <- data.frame(DATE=Sys.Date()-sample(365,N,TRUE),
                   USER=sample(1e3,N,TRUE))
Data <- Data[order(Data$USER,Data$DATE),]
system.time({
  Data$DIFF <- unlist(tapply(Data$DATE,Data$USER, function(x) c(NA,diff(x))))
})
#   user  system elapsed 
#   1.58    0.00    1.59
Data2 <- Data
system.time({
  Data2$DIFF <- c(NA,diff(Data2$DATE))
  is.na(Data2$DIFF) <- which(c(NA,diff(Data2$USER))==1)
})
#   user  system elapsed 
#   0.12    0.00    0.12
identical(Data,Data2)
# [1] TRUE
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418