-3

Suppose we have a folder containing very many .csv files, each containing the same columns, each with different data. What is the fastest way to read them all into one R data.frame/data.table?

A similar question was asked in 2012. If anyone is using a version of R from before data.table was released (August, 2008), please visit that question instead.

webb
  • 4,180
  • 1
  • 17
  • 26

1 Answers1

4

Time to read 121401 csvs into a single data.table. Each time is an average of three runs then rounded. Each csv has 3 columns, one header row, and, on average, 4.510 rows. Machine is a GCP VM with 96 cores.

rbindlist lapply read.delim  500s
rbindlist lapply fread       250s    (example at end)
rbindlist mclapply fread      10s    (example at end)
fread cat                      5s <- fastest; main topic of this answer

The fread cat solution uses the cat shell command to concatenate all of the files together, then uses data.table's fread to read the result:

x = fread(cmd='cat *.csv', header=F)

However, what if each csv has a header? Concatenating them all would mix in many headers with data rows.

x = fread(cmd="awk 'NR==1||FNR!=1' *.csv", header=T)

And what if you have so many files that the *.csv shell glob fails?

x = fread(cmd='find . -name "*.csv" | xargs cat', header=F)

And what if all files have a header AND there are too many files?

header = fread(cmd='find . -name "*.csv" | head -n1 | xargs head -n1', header=T)
x = fread(cmd='find . -name "*.csv" | xargs tail -q -n+2', header=F)
setnames(x,header)

And what if the resulting concatenated csv is too big for system memory? (e.g., /dev/shm out of space error)

system('find . -name "*.csv" | xargs cat > combined.csv')
x = fread('combined.csv', header=F)

With headers?

system('find . -name "*.csv" | head -n1 | xargs head -n1 > combined.csv')
system('find . -name "*.csv" | xargs tail -q -n+2 >> combined.csv')
x = fread('combined.csv', header=T)

What if you don't want all .csv in a directory, but rather a specific set of files? Also, they all have headers.

fread(text=paste0(system("xargs cat|awk 'NR==1||$1!=\"<column one name>\"'",input=paths,intern=T),collapse="\n"),header=T,sep="\t")

Here is a function I use in production to handle several of these situations, but you should test it thoroughly with your own data before trusting it. To include a column for the inpath of each csv, use keep_inpath=T and explicitly specify the csv separator with the sep argument.

fread_many = function(files,header=T,keep_inpath=F,sep="auto",...){
  if(length(files)==0) return()
  if(typeof(files)!='character') return()
  files = files[file.exists(files)]
  if(length(files)==0) return()
  # note 1: requires awk, not cat or tail because some files have no final newline
  # note 2: parallel --xargs is 40% slower
  # note 3: reading to var is 15% slower and crashes R if the string is too long
  # note 4: shorter paths -> more paths per awk -> fewer awks -> measurably faster
  #         so best cd to the csv dir and use relative paths
  tmp = tempfile(fileext = ".csv")
  if(keep_inpath==T){
    stopifnot(sep!="auto")
    if(header==T){
      system(paste0('/usr/bin/echo -ne inpath"',sep,'" > ',tmp))
      system(paste0('head -n1 ',files[1],' >> ',tmp))
      system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{if(FNR>1)print FILENAME,$0}' >> ",tmp),input=files)
    } else {
      system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{print FILENAME,$0}' > ",tmp),input=files)
    }
  } else {
    if(header==T){
      system(paste0('head -n1 ',files[1],' > ',tmp))
      system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
    } else {
      system(paste0("xargs awk '1' > ",tmp),input=files)
    }
  }
  DT = fread(file=tmp,header=header,sep=sep,...)
  file.remove(tmp)
  DT
}

Caveat: all of my solutions that concatenate the csvs before reading them assumes they all have the same separator. If not all of your csvs use the same delimiter, instead use rbindlist lapply fread, rbindlist mclapply fread, or fread xargs cat in batches, where all csvs in a batch use the same separator. Examples:

# single core: rbindlist lapply fread
x = rbindlist(lapply(files,fread)),use.names=T,fill=T)

# multicore: rbindlist mclapply fread
x = rbindlist(mclapply(files,fread)),use.names=T,fill=T)

# fread cat in batches
x = fread(cmd='cat format1/*.csv')
y = fread(cmd='cat format2/*.csv')
z = rbind(x,y,use.names=T,fill=T)

Here is the first place I heard about fread cat: https://stackoverflow.com/a/11433740/1563960

webb
  • 4,180
  • 1
  • 17
  • 26
  • 1
    What if you setDTthreads(1) to reduce overhead? Very good answer. Use setnames(x,) instead of names(x)<- – jangorecki Jul 14 '22 at 08:19
  • Thanks! Changed to use setnames(). Have you found setDTthreads(1) during reading saves time? I thought multicore made fread() faster even on one large file. – webb Sep 03 '22 at 07:12
  • If file is small it can actually take more time to parallelize its reading than reading itself. These are usually microseconds but for thousands of files can add up. Thus important to check on your exact use case. – jangorecki Sep 03 '22 at 10:03