Data Frame vs Data Table

Question

I have heard people everywhere saying use data.table instead of data.frame or you can use data.table where ever you use data frame, but still i see a lot of differences like these

> myDF <- data.frame(x = rnorm(3), y = rnorm(3))                                                                                                                        
> myDT <- data.table(myDF)
> myDT[,1]                                                                                                                                                              
[1] 1
> myDF[,1]                                                                                                                                                              
[1] 0.6621419 0.8494085 0.6490634
> myDF[,c("x","y")]
          x          y
1 0.6621419 -1.8987699
2 0.8494085 -0.6273099
3 0.6490634  0.4566892
> myDT[,c("x","y")]
[1] "x" "y"
> myDT[,x,y]
            y         x
1: -1.8987699 0.6621419
2: -0.6273099 0.8494085
3:  0.4566892 0.6490634
> myDF[,x,y]
Error in `[.data.frame`(myDF, , x, y) : object 'y' not found
>

How exactly are they different and which one should i use?

I second that - Read the [FAQ](http://datatable.r-forge.r-project.org/datatable-faq.pdf) — Andrie, Feb 12 '13 at 13:26
You cannot use 'data.table' wherever you use a data frame, but you can always build something with a 'data.table' *instead* of a data frame. I personally don't like the overhead of programming against 'data.table', unless it perfectly suits the needs of a specific project. The 'data.table' methods aren't faster, unless you're doing large amounts of data aggregation and sorting without predisposed knowledge. I usually know something about the structure of my data that allows simple filters to work much faster than a 'data.table' method. — Dinre, Feb 12 '13 at 13:29
In general, when you are not annoyed by how slow a certain analysis is, there is no need to go beyond the standard R `data.frame`. So, if you are a beginner I would stick to the base R soltuion first. There are certain applications where `data.table` really shines, for example calculating the mean value per unique id for a large dataset (say > 1e6 rows). In this case `data.table` is much faster than a standard R solution, let alone a `plyr` based solution. I have been using R for years now, and I have never needed `data.table`, although I have been tempted many times. — Paul Hiemstra, Feb 12 '13 at 13:45
I agree with Paul. Stick with a data frame unless you have issues. There are "certain applications," as Paul puts it, where a 'data.table' method will be clearly better, but unless you have one of those "certain applications, 'data.table' is sometimes inferior. In my personal case, I have yet to experience a data set where I couldn't code a faster solution than the corresponding 'data.table' method. — Dinre, Feb 12 '13 at 15:05
What Paul and Dinre said, only adding that *if/when you do need it*, **data.table** will feel like a miraculous and undeserved gift. — Josh O'Brien, Feb 12 '13 at 17:09
data.table was very handy when I was trying to do a complicated conditional mutate with dplyr. I used data.table instead and ended up with faster and easier to read code. See the [thread here](http://stackoverflow.com/questions/24459752/can-dplyr-package-be-used-for-conditional-mutating) and look at Arun's answer. — variable, Feb 27 '15 at 21:08

Data Frame vs Data Table

0 Answers0