How to optimize subsetting from a large dataset?

Question

I have this dataset of financial transactions, its pretty big but small enough to keep in memory..

R> str(trans)
'data.frame':   130000000 obs. of  5 variables:
 $ id    : int  5 5 5 5 6 11 11 11 11 11 ...
 $ kod   : int  2 3 2 3 38 2 3 6 7 6 ...
 $ ar    : int  329 329 330 330 7 329 329 329 329 329 ...
 $ belopp: num  1531 -229.3 324 -48.9 0 ...
 $ datum : int  36976 36976 37287 37287 37961 36976 36976 37236 37236 37281 ...

I need to loop through it extracting the transactions for each unique id, and do a bunch of calculations. The trouble is that the subsetting of the dataset is way too slow..

R> system.time(
+ sub <- trans[trans$id==15,]
+ )
   user  system elapsed 
   7.80    0.55    8.36


R> system.time(
+ sub <- subset(trans, id == 15)
+ )
   user  system elapsed 
   8.49    1.05    9.53

As there are about 10m unique id's in this dataset, such a loop would take forever, any ideas how I might speed it up?

EDIT I've dabbled with ´data.tables´, indexing and sorting with not much luck at all..

library(data.table)
trans2 <- as.data.table(trans)
trans2 <- trans2[order(id)]
trans2 <- setkey(trans2, id)

R> system.time(
+ sub <- trans2[trans2$id==15,]
+ )
   user  system elapsed 
   7.33    1.08    8.41 

R> system.time(
+ sub <- subset(trans2, id == 15)
+ )
   user  system elapsed 
   8.66    1.12    9.78

EDIT2 Awesome.

R> system.time(
+ sub <- trans2[J(15)]
+ )
   user  system elapsed 
      0       0       0

Could you describe what you've tried with `data.table`s and where you've run into problems? — BenBarnes, Jan 03 '13 at 13:12
@BenBarnes Sure, I tried converting the **data.frame** to a `data.table`, sorting it and using the `setkey()` function to index the _id_ column. This did not lower the subsetting time at all.. — jenswirf, Jan 03 '13 at 13:22
Are you still using a loop? `data.table`'s `,by` argument to `[` will much faster, I think: http://stackoverflow.com/questions/11279304/split-apply-recombine-plyr-data-table-in-r — Ari B. Friedman, Jan 03 '13 at 13:28
If you're converting to a `data.table` and using `setkey`, you shouldn't need to sort in an extra step. Also, for subsetting, try (with your keyed data.table named, say, `transDT`) `transDT[J(15)]`. Please also consider editing your question to include the data.table code you used. — BenBarnes, Jan 03 '13 at 13:41

Arun · Accepted Answer · 2013-01-03T21:31:05.250

3

Note: The post has been edited by changing the function being calculated from rowSums to colSums (using lapply in case of data.table)

I don't think you could get the result faster than data.table. Here's a benchmark between plyr and data.table. Of course if the time-consuming part is your function, then you could use doMC to run in parallel using plyr (assuming you have a lot of cores or you work on a cluster). Else, I'd stick to data.table. Here's an analysis with a huge test data and a dummy function:

# create a huge data.frame with repeating id values
len <- 1e5
reps <- sample(1:20, len, replace = TRUE)
x <- data.frame(id = rep(1:len, reps))
x <- transform(x, v1 = rnorm(nrow(x)), v2 = rnorm(nrow(x)))

> nrow(x) 
[1] 1048534 # 1 million rows

# construct functions for data.table and plyr
# method 1
# using data.table
DATA.TABLE <- function() {
    require(data.table)
    x.dt <- data.table(x, key="id")
    x.dt.out <- x.dt[, lapply(.SD, sum), by=id]
}

# method 2
# using plyr
PLYR <- function() {
    require(plyr)
    x.plyr.out <- ddply(x, .(id), colSums)
}

# let's benchmark
> require(rbenchmark)
> benchmark(DATA.TABLE(), PLYR(), order = "elapsed", replications = 1)[1:5]
          test replications elapsed relative user.self
1 DATA.TABLE()           1  1.006     1.00    .992
2       PLYR()           1  67.755   67.351  67.688

On a data.frame with 1 million rows, data.table takes 0.992 seconds. The speed-up using data.table compared to plyr (admittedly, on computing column sums) is 68x. Depending on the computation time in your function, this speed-up will vary. But data.table will still be way faster. plyr is a split-apply-combine strategy. I don't think you'd get a comparable speed-up compared to using base to split, apply and combine yourself. Of course you can try it.

I ran the code with 10 million rows. data.table ran in 5.893 seconds. plyr took 6300 seconds.

edited Jan 03 '13 at 21:31

answered Jan 03 '13 at 13:49

Arun

116,683
26
284
387

1

+1 but why test `sum(rowSums(w))`? It's important to do things by column in data.table, _never_ by row as that'll be page inefficient. Try testing `lapply(.SD,sum)` and `sum(lapply(.SD,sum))` and they should be faster. The 22 seconds is the total for 10 runs, too, it seems. Anyone glancing at this answer who doesn't know R or data.table or plyr might be aghast that this takes 22 seconds on just 1e6 rows. It doesn't, it takes 2.2s, and the `lapply` should be better still. – Matt Dowle Jan 03 '13 at 14:44
I was focussing on the `Update:` part there. Now I see previous paragraph but I'm not 100% clear. – Matt Dowle Jan 03 '13 at 14:51
@Matthew, All I want to show is that the only limiting factor is the time taken for the function to compute. I really don't mind if the function takes more time. The purpose was to show that for the same function, `data.table` would be the fastest. Not sure if this makes things clear..? – Arun Jan 03 '13 at 14:56
Ok I kind of see, that helps, but could you use `lm` or something then for that? A benchmark using `rowSums` is sending me (at least) off on a tangent. – Matt Dowle Jan 03 '13 at 15:44
modified using `lapply(.SD, sum)`. `One question:` When run `data.table` part as `x.dt.out <- x.dt[, colSums(.SD), by=id]` why do I get a `20,000 * 2` data.table instead of `10,000 * 3` data.table? The results are pooled together on to 1 column – Arun Jan 03 '13 at 19:45
1

Yep, because the result of `colSums` is a vector. To get it as columns it's `as.list(colSums(.SD))` but of course the `as.list` will copy all those small vectors into a new (small) `list` for each group, so the direct `lapply(.SD,sum)` should be faster by avoiding those coercions. – Matt Dowle Jan 03 '13 at 22:58
1

Great answer btw, that's more like it :) – Matt Dowle Jan 03 '13 at 23:01

score 0 · Answer 2 · answered Jan 03 '13 at 14:03

0

Why not use the split, apply and combine strategy?

Something like this (without a sample data I don't know if this will work):

fastsplit <- function (df) {
  lista <- split(seq(nrow(df)), df$id)
  return(lista)
}

# function to split the data frame into a list by id

lista_split <- fastsplit(trans)

# now, assuming that one of the calculations is, for instance, to sum belopp
# apply the function to each subset

result1 <- lapply(lista_split, function(.indx){

  sum_bellop = sum(trans$belopp[.indx])})

# combine stage
r1 <- do.call(rbind, result1)

Having provided the code above, I'd say that it would be faster and easier if you could use SQL. Maybe the package sqldf could help you here? I never tryed it though. I don't know if it is fast. The code in SQL is quite simple. To do the same as the R code above, just use Something like:

select id
       , sum(belopp) as sum_bellop from trans
group by id

This will return a table with two columns, id and sum of belopp by id

answered Jan 03 '13 at 14:03

Manoel Galdino

2,376
6
27
40

Having seen @Arun answer, I must say I never used data.table, so I don't know if my solution is faster than data.table. However, I do know that my solution is quite faster than plyr. Still, I'd suggest you to move to SQL.. – Manoel Galdino Jan 03 '13 at 14:09
I will for sure test if the split strategy is superior to using a huge loop which was my first thought. However, you wrongly assumed I only need a simple calculation on each subset, there are in fact a few hundred lines of code there so SQL will just not work. – jenswirf Jan 03 '13 at 14:20
Please let me know if you find out if the split strategy is better than data.table... – Manoel Galdino Jan 03 '13 at 17:34
Hi Manoel. This may give some insight about data.table vs SQL speed : http://stackoverflow.com/a/6594199/403310. – Matt Dowle Jan 03 '13 at 23:29
Thank you @Matthew Dowle. I must confess that I'm quite surprised to learn that data.table is faster than SQL. I'll for sure take a look at data.table. – Manoel Galdino Jan 04 '13 at 16:42
@ManoelGaldino To be clear, not for _everything_. Just at what data.table can do and claims to be fast at. It's in-memory only (but 64bit and 128GB RAM is quite a bit) whereas obviously SQL can handle larger on disk databases which data.table can't do. And SQL is better/faster at inserting and deleting records at arbitrary locations in a table, which data.table can't do at all. But in R we tend to read in data and then work with it a lot (write once read many). The fact that R, and data.table, provide guaranteed row order is a big win over row-stored SQL; shorter and easier syntax. – Matt Dowle Jan 04 '13 at 18:11
And of course R has more functions (from 4,000 packages) than SQL does, all usable from data.table. Or you can emded R into the SQL database, too; e.g., you can emded R into Postgress I believe. – Matt Dowle Jan 04 '13 at 18:22
I use R with SQL (making queries in R with a RODBC connection). – Manoel Galdino Jan 07 '13 at 12:12

How to optimize subsetting from a large dataset?

2 Answers2

Linked