Time to read 121401 csvs into a single data.table. Each time is an average of three runs then rounded. Each csv has 3 columns, one header row, and, on average, 4.510 rows. Machine is a GCP VM with 96 cores.
rbindlist lapply read.delim 500s
rbindlist lapply fread 250s (example at end)
rbindlist mclapply fread 10s (example at end)
fread cat 5s <- fastest; main topic of this answer
The fread cat solution uses the cat
shell command to concatenate all of the files together, then uses data.table's fread
to read the result:
x = fread(cmd='cat *.csv', header=F)
However, what if each csv has a header? Concatenating them all would mix in many headers with data rows.
x = fread(cmd="awk 'NR==1||FNR!=1' *.csv", header=T)
And what if you have so many files that the *.csv
shell glob fails?
x = fread(cmd='find . -name "*.csv" | xargs cat', header=F)
And what if all files have a header AND there are too many files?
header = fread(cmd='find . -name "*.csv" | head -n1 | xargs head -n1', header=T)
x = fread(cmd='find . -name "*.csv" | xargs tail -q -n+2', header=F)
setnames(x,header)
And what if the resulting concatenated csv is too big for system memory? (e.g., /dev/shm out of space error)
system('find . -name "*.csv" | xargs cat > combined.csv')
x = fread('combined.csv', header=F)
With headers?
system('find . -name "*.csv" | head -n1 | xargs head -n1 > combined.csv')
system('find . -name "*.csv" | xargs tail -q -n+2 >> combined.csv')
x = fread('combined.csv', header=T)
What if you don't want all .csv in a directory, but rather a specific set of files? Also, they all have headers.
fread(text=paste0(system("xargs cat|awk 'NR==1||$1!=\"<column one name>\"'",input=paths,intern=T),collapse="\n"),header=T,sep="\t")
Here is a function I use in production to handle several of these situations, but you should test it thoroughly with your own data before trusting it. To include a column for the inpath of each csv, use keep_inpath=T and explicitly specify the csv separator with the sep argument.
fread_many = function(files,header=T,keep_inpath=F,sep="auto",...){
if(length(files)==0) return()
if(typeof(files)!='character') return()
files = files[file.exists(files)]
if(length(files)==0) return()
# note 1: requires awk, not cat or tail because some files have no final newline
# note 2: parallel --xargs is 40% slower
# note 3: reading to var is 15% slower and crashes R if the string is too long
# note 4: shorter paths -> more paths per awk -> fewer awks -> measurably faster
# so best cd to the csv dir and use relative paths
tmp = tempfile(fileext = ".csv")
if(keep_inpath==T){
stopifnot(sep!="auto")
if(header==T){
system(paste0('/usr/bin/echo -ne inpath"',sep,'" > ',tmp))
system(paste0('head -n1 ',files[1],' >> ',tmp))
system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{if(FNR>1)print FILENAME,$0}' >> ",tmp),input=files)
} else {
system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{print FILENAME,$0}' > ",tmp),input=files)
}
} else {
if(header==T){
system(paste0('head -n1 ',files[1],' > ',tmp))
system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
} else {
system(paste0("xargs awk '1' > ",tmp),input=files)
}
}
DT = fread(file=tmp,header=header,sep=sep,...)
file.remove(tmp)
DT
}
Caveat: all of my solutions that concatenate the csvs before reading them assumes they all have the same separator. If not all of your csvs use the same delimiter, instead use rbindlist lapply fread, rbindlist mclapply fread, or fread xargs cat in batches, where all csvs in a batch use the same separator. Examples:
# single core: rbindlist lapply fread
x = rbindlist(lapply(files,fread)),use.names=T,fill=T)
# multicore: rbindlist mclapply fread
x = rbindlist(mclapply(files,fread)),use.names=T,fill=T)
# fread cat in batches
x = fread(cmd='cat format1/*.csv')
y = fread(cmd='cat format2/*.csv')
z = rbind(x,y,use.names=T,fill=T)
Here is the first place I heard about fread cat: https://stackoverflow.com/a/11433740/1563960