I have read FAQ but it is still now clear on what the implications are in using key vs. using that key in concatenated list of reasonably large data.table?
From my experiment, I see only performance but not sure if there is any thing else.
# install.packages(c("data.table", "stringi"), dependencies = TRUE)
library(data.table)
library(stringi)
download.file("https://www.ssa.gov/oact/babynames/state/namesbystate.zip", dest="namesbystate.zip", mode="wb")
unzip("namesbystate.zip", exdir=".")
# Read the list of all text files in variable "filelist"
filelist = list.files(path=".",pattern = ".*.TXT")
colnamelist=c("State","gender","year","name","frequency")
#Read the CSV from all the text files into a data.frame
babynames =lapply(filelist, FUN=read.csv, header=FALSE,col.name=colnamelist);
nametable = rbindlist(babynames,use.names = FALSE,fill = FALSE)
DT = data.table(nametable)
dim(DT) #[1] 5647426 5
setkey(DT,NULL)
system.time(head(DT[,( stri_length(name)),by=c("name", "year")]))
# user system elapsed
# 156.47 0.03 157.64
setkey(DT,year)
system.time(head(DT[,( stri_length(name)),by=name]))
# user system elapsed
# 8.90 0.00 8.99
The output is identical in both cases
name year V1
1: Anna 1910 4
2: Annie 1910 5
3: Dorothy 1910 7
4: Elsie 1910 5
5: Helen 1910 5
6: Lucy 1910 4