13

I have heard people everywhere saying use data.table instead of data.frame or you can use data.table where ever you use data frame, but still i see a lot of differences like these

> myDF <- data.frame(x = rnorm(3), y = rnorm(3))                                                                                                                        
> myDT <- data.table(myDF)
> myDT[,1]                                                                                                                                                              
[1] 1
> myDF[,1]                                                                                                                                                              
[1] 0.6621419 0.8494085 0.6490634
> myDF[,c("x","y")]
          x          y
1 0.6621419 -1.8987699
2 0.8494085 -0.6273099
3 0.6490634  0.4566892
> myDT[,c("x","y")]
[1] "x" "y"
> myDT[,x,y]
            y         x
1: -1.8987699 0.6621419
2: -0.6273099 0.8494085
3:  0.4566892 0.6490634
> myDF[,x,y]
Error in `[.data.frame`(myDF, , x, y) : object 'y' not found
>

How exactly are they different and which one should i use?

FUD
  • 5,114
  • 7
  • 39
  • 61
  • I second that - Read the [FAQ](http://datatable.r-forge.r-project.org/datatable-faq.pdf) – Andrie Feb 12 '13 at 13:26
  • 4
    You cannot use 'data.table' wherever you use a data frame, but you can always build something with a 'data.table' *instead* of a data frame. I personally don't like the overhead of programming against 'data.table', unless it perfectly suits the needs of a specific project. The 'data.table' methods aren't faster, unless you're doing large amounts of data aggregation and sorting without predisposed knowledge. I usually know something about the structure of my data that allows simple filters to work much faster than a 'data.table' method. – Dinre Feb 12 '13 at 13:29
  • 7
    In general, when you are not annoyed by how slow a certain analysis is, there is no need to go beyond the standard R `data.frame`. So, if you are a beginner I would stick to the base R soltuion first. There are certain applications where `data.table` really shines, for example calculating the mean value per unique id for a large dataset (say > 1e6 rows). In this case `data.table` is much faster than a standard R solution, let alone a `plyr` based solution. I have been using R for years now, and I have never needed `data.table`, although I have been tempted many times. – Paul Hiemstra Feb 12 '13 at 13:45
  • 1
    I agree with Paul. Stick with a data frame unless you have issues. There are "certain applications," as Paul puts it, where a 'data.table' method will be clearly better, but unless you have one of those "certain applications, 'data.table' is sometimes inferior. In my personal case, I have yet to experience a data set where I couldn't code a faster solution than the corresponding 'data.table' method. – Dinre Feb 12 '13 at 15:05
  • 6
    What Paul and Dinre said, only adding that *if/when you do need it*, **data.table** will feel like a miraculous and undeserved gift. – Josh O'Brien Feb 12 '13 at 17:09
  • data.table was very handy when I was trying to do a complicated conditional mutate with dplyr. I used data.table instead and ended up with faster and easier to read code. See the [thread here](http://stackoverflow.com/questions/24459752/can-dplyr-package-be-used-for-conditional-mutating) and look at Arun's answer. – variable Feb 27 '15 at 21:08

0 Answers0