0

I have read FAQ but it is still now clear on what the implications are in using key vs. using that key in concatenated list of reasonably large data.table?

From my experiment, I see only performance but not sure if there is any thing else.

# install.packages(c("data.table", "stringi"), dependencies = TRUE)
library(data.table)
library(stringi)
download.file("https://www.ssa.gov/oact/babynames/state/namesbystate.zip", dest="namesbystate.zip", mode="wb")
unzip("namesbystate.zip", exdir=".")
# Read the list of all text files in variable "filelist"
filelist = list.files(path=".",pattern = ".*.TXT")
colnamelist=c("State","gender","year","name","frequency")
#Read the CSV from all the text files into a data.frame
babynames =lapply(filelist, FUN=read.csv, header=FALSE,col.name=colnamelist);
nametable = rbindlist(babynames,use.names = FALSE,fill = FALSE)
DT = data.table(nametable)
dim(DT) #[1] 5647426       5
setkey(DT,NULL)
system.time(head(DT[,( stri_length(name)),by=c("name", "year")]))
#    user  system elapsed 
#  156.47    0.03  157.64 

setkey(DT,year)
system.time(head(DT[,( stri_length(name)),by=name]))
#    user  system elapsed 
#    8.90    0.00    8.99 

The output is identical in both cases

      name year V1
1:    Anna 1910  4
2:   Annie 1910  5
3: Dorothy 1910  7
4:   Elsie 1910  5
5:   Helen 1910  5
6:    Lucy 1910  4
Community
  • 1
  • 1
  • 1
    You can test the timing for two steps together like `system.time({setkey(DT,year); DT[,nchar(name),by=year]})`, fyi. – Frank Jan 22 '16 at 20:03
  • 2
    Please share the data via `dput()`, see [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). – Eric Fail Jan 22 '16 at 20:08
  • If you can't share the data itself you should strive to produce [a minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve)? – Eric Fail Jan 22 '16 at 20:39
  • 1
    I highly doubt those two expressions produce identical results. – eddi Jan 22 '16 at 20:40
  • 2
    see: http://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table, specifically "Is there an advantage to setting key on by= operations?". – Chris Jan 22 '16 at 21:44
  • Great! I updated your code a bit (added `https`) and ran it. I don't get identical output. The former gives me three variables and 548154 observations while the latter yields two variables and 30274 observations. It might look identical from the `head()`, but it's not identical. I suggest you make a dummy data set and play with the two difference approaches to see the difference more clearly. – Eric Fail Jan 22 '16 at 21:52
  • provide session info – jangorecki Jan 22 '16 at 22:15
  • 2
    @AdityaKher I think it's much more likely that you're just confused and/or accidentally looked at some **other** output to conclude that the results are the same, than the results actually being the same. The commands you wrote above will (at the very least) produce different number of columns. – eddi Jan 23 '16 at 00:43

0 Answers0