42

I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.

## First put all file names into a list 
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")

## Read data using fread
readdata <- function(fn){
    dt_temp <- fread(fn, sep=",")
    keycols <- c("ID", "date")
    setkeyv(dt_temp,keycols)  # Notice there's a "v" after setkey with multiple keys
    return(dt_temp)

}
# then using 
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)

The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.

If I use the fread to read a single file, it's fast. But using apply, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?

I tried the llply in plyr package, there're not much speed gains.

Also, is there any syntax in data.table to achieve vertical join like rbind and union in sql?

Thanks.

Arun
  • 116,683
  • 26
  • 284
  • 387
Bigchao
  • 1,746
  • 3
  • 15
  • 31

2 Answers2

61

Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

And as @Roland says, do not set the key in each iteration of your function!

So in summary, this is best :

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • 4
    Also, set they key only once at the end. – Roland Jan 16 '14 at 08:17
  • @SimonO'Hanlon, thanks a lot. Is the for loop faster than lapply? – Bigchao Jan 16 '14 at 13:22
  • @Bigchao indeterminate. But if you think about it, what do you expect 99.999% of your processing time to be? The computational overhead of either `for` or `lapply` or reading in `1e6` observations of data? It's totally arbitrary in this case. I *think* that memory management might be better using the `for` loop, and it certainly isn't worse than `lapply`. There will be no speed difference between the two. – Simon O'Hanlon Jan 16 '14 at 13:24
  • 1
    If you are calling files outside of your working directory, be sure to add `full.names = TRUE` to `list.files()` e.g. `list.files(path = "C:/Users",pattern = ".txt",full.names=TRUE)`. This will attach the full file path to each called file, allowing the `lapply` function to successfully locate and operate on each file. – TheSciGuy May 02 '19 at 15:58
  • This is often still too slow. Here are some 25-50x faster ways https://stackoverflow.com/a/58131427/1563960 – webb Aug 02 '21 at 15:46
1

I've re-written the code to do this way too many times.. Finally rolled it into a handy function, below.

data.table_fread_mult <- function(filepaths = NULL, dir = NULL, recursive = FALSE, pattern = NULL, fileCol = FALSE, ...){
  # fread multiple filepaths and then combine the results into a single data.table
  # This function has two interfaces: either
  # 1) provide `filepaths` as a character vector of filepaths to read or 
  # 2) provide `dir` (and optionally `pattern` and `recursive`) to identify the directory to read from
  # If fileCol = TRUE, result will incude a column called File with the full source file path of each record
  # ... should be arguments to pass on to fread()
  # `pattern` is an optional regular expression to match files (e.g. pattern='csv$' matches files ending with 'csv')
  
  if(!is.null(filepaths) & (!is.null(dir) | !is.null(pattern))){
    stop("If `filepaths` is given, `dir` and `pattern` should be NULL")
  } else if(is.null(filepaths) & is.null(dir)){
    stop("If `filepaths` is not given, `dir` should be given")
  }
  
  # If filepaths isn't given, build it from dir, recursive, pattern
  if(is.null(filepaths)){
    filepaths <- list.files(
      path = dir, 
      full.names = TRUE, 
      recursive = recursive, 
      pattern = pattern
    )
  }
  
  # Read and combine files
  if(fileCol){
    return(rbindlist(lapply(filepaths, function(x) fread(x, ...)[, File := x]), use.names = TRUE))
  } else{
    return(rbindlist(lapply(filepaths, fread, ...), use.names = TRUE))
  }
}
Ben
  • 20,038
  • 30
  • 112
  • 189
  • 1
    How can I add a column with file names using this function? For example, if I have `sample1.txt sample2.txt sample3.txt` in my dir, I would like to read and merge them to one data table with V2 as filename (e.g. sample1). So my data would look like `sample1scontent sample1` – Isin Altinkaya Jul 19 '21 at 11:55
  • 1
    This is often still too slow. Here are some 25-50x faster ways https://stackoverflow.com/a/58131427/1563960 – webb Aug 02 '21 at 15:46