13

I just spent some time researching about data.table in R and was wondering about the conditions under which I can expect the largest performance gains. Maybe the simple answer is when I have a large data.frame and often operate on subsets of this data.frame. When I just load data files and estimate models I can't expect much but many [ operations make the difference. Is that true and the only answer or what else should I consider? When does it start to matter? 10x5, 1,000x5, 1,000,000x5?

Edit: Some of the comments suggest that data.table is often faster and, equally important, almost never slower. So it would also be good to know when not to use data.table.

user2503795
  • 4,035
  • 2
  • 34
  • 49
  • 3
    A great (if broad) question. Ideally this would be answered by pointing to the timings vignette ([latest version here](http://datatable.r-forge.r-project.org/datatable-timings.pdf)), but at this point it's quite undeveloped. I'm sure, though, that Matthew Dowle would appreciate help with it or some similar document! – Josh O'Brien Dec 06 '12 at 19:27
  • 1
    It might be worth mentioning that using a `data.table` is probably never slower than using a `data.frame` (if you find a case, I bet it will get patched quickly). In addition to speed of calculation, a `data.table` solution will probably take fewer keystrokes. – GSee Dec 06 '12 at 19:32
  • 2
    Because what GSee says is true in my experience, and in general data.table inherits from data.frame, I think the question could be better posed as when *not* to use data.table. The only thing I have wanted to do with it that I could not is use rbind.fill() – frankc Dec 06 '12 at 20:32
  • Great comments, great answers so far! franks, I will add that point to my questions – user2503795 Dec 06 '12 at 20:40
  • @frankc +1 But what's the issue with `rbind.fill`? This works fine for me: `rbind.fill(as.data.table(mtcars[c("mpg", "wt")]), as.data.table(mtcars[c("wt", "cyl")]))`. – Matt Dowle Dec 07 '12 at 11:43

2 Answers2

10

There are at least a few cases where data.table shines:

  • Updating an existing dataset with new results. Because data.table is by-reference, this is massively faster.
  • Split-apply-combine type strategies with large numbers of groups to split over (as @PaulHiemstra's answer points out).
  • Doing almost anything to a truly large dataset.

Here are some benchmarks: Benchmarking data.frame (base), data.frame(package dataframe) and data.table

Community
  • 1
  • 1
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
7

One instance where data.table is veeeery fast is in the split-apply-combine type of work which made plyr famous. Say you have a data.frame with the following data:

precipitation     time   station_id
23.3              1      A01
24.1              2      A01
26.1              1      A02
etc etc

When you need to average per station id, you can use a host of R functions, e.g. ave, ddply, or data.table. If the number of unique elements in station_id grows, data.table scales really well, whilst e.g. ddply get's really slow. More details, including an example, can be found in this post on my blog. This test suggests that speed increases of more than 150 fold are possible. This difference can probably be much bigger...

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • 5
    The next iteration of `plyr`, `dplyr`, will be fighting back (performance wise). Should be 10-100x faster, and within a factor of 10 of the speed of `data.table`. (All using pure R too). It will also let you use `data.table` as a backend, so you can have the best of both worlds. – hadley Dec 06 '12 at 20:48
  • That souds awesome! When will it be released? – Paul Hiemstra Dec 06 '12 at 22:18
  • Yes, sounds awesome! Where are the gains coming from? And how does the integration with `data.table` work? An option for the backend or based on the passed arguments? – user2503795 Dec 06 '12 at 23:01
  • @PaulHiemstra in the next 6 months, all going well. The gains come from specialising the most common parts of plyr (e.g. ddply + subset/summarise/mutate/arrange) and minimising the number of intermediate subsets of the data frame that are made. It will also support SQL databases as a backend. – hadley Dec 07 '12 at 14:15