14

There is a lot of documentation on how to read multiple CSVs and bind them into one data frame. I have 5000+ CSV files I need to read in and bind into one data structure.

In particular I've followed the discussion here: Issue in Loading multiple .csv files into single dataframe in R using rbind

The weird thing is that base R is much faster than any other solution I've tried.

Here's what my CSV looks like:

> head(PT)
  Line          Timestamp       Lane.01 Lane.02 Lane.03 Lane.04 Lane.05 Lane.06 Lane.07 Lane.08
1    PL1    05-Jan-16 07:17:36      NA      NA      NA      NA      NA      NA      NA      NA
2    PL1    05-Jan-16 07:22:38      NA      NA      NA      NA      NA      NA      NA      NA
3    PL1    05-Jan-16 07:27:41      NA      NA      NA      NA      NA      NA      NA      NA
4    PL1    05-Jan-16 07:32:43    9.98   10.36   10.41   10.16   10.10    9.97   10.07    9.59
5    PL1    05-Jan-16 07:37:45    9.65    8.87    9.88    9.86    8.85    8.75    9.19    8.51
6    PL1    05-Jan-16 07:42:47    9.14    8.98    9.29    9.04    9.01    9.06    9.12    9.08

I've created three methods for reading in and binding the data. The files are located in a separate directory which I define as:

dataPath <- "data"
PTfiles <- list.files(path=dataPath, full.names = TRUE)

Method 1: Base R

classes <- c("factor", "character", rep("numeric",8))

# build function to load data
load_data <- function(dataPath, classes) { 
   tables <- lapply(PTfiles, read.csv, colClasses=classes, na.strings=c("NA", ""))
   do.call(rbind, tables)
}

#clock
method1 <- system.time(
   PT <- load_data(path, classes)
)

Method 2: read_csv In this case I created a wrapper function for read_csv to use

#create wrapper function for read_csv
read_csv.wrap <- function(x) { read_csv(x, skip = 1, na=c("NA", ""),
                      col_names = c("tool", "timestamp", paste("lane", 1:8, sep="")),
                      col_types = 
                         cols(
                            tool = col_character(),
                            timestamp = col_character(),
                            lane1 = col_double(),
                            lane2 = col_double(),
                            lane3 = col_double(),
                            lane4 = col_double(),
                            lane5 = col_double(),
                            lane6 = col_double(),
                            lane7 = col_double(),
                            lane8 = col_double()
                           )
                     )
}

##
# Same as method 1, just uses read_csv instead of read.csv

load_data2 <- function(dataPath) { 
   tables <- lapply(PTfiles, read_csv.wrap)
   do.call(rbind, tables)
}

#clock
method2 <- system.time(
   PT2 <- load_data2(path)
)

Method 3: read_csv + dplyr::bind_rows

load_data3 <- function(dataPath) { 
   tables <- lapply(PTfiles, read_csv.wrap)
   dplyr::bind_rows(tables)
}

#clock
method3 <- system.time(
   PT3 <- load_data3(path)
)

What I can't figure out, is why read_csv and dplyr methods are slower for elapsed time when they should be faster. The CPU time is decreased, but why would the elapsed time (file system) increase? What's going on here?

Edit - I added the data.table method as suggested in the comments

Method 4 data.table

library(data.table)

load_data4 <- function(dataPath){
   tables <- lapply(PTfiles, fread)
   rbindlist(tables)
}

method4 <- system.time(
   PT4 <- load_data4(path)
)

The data.table method is the fastest from a CPU standpoint. But the question still stands on what is going on with the read_csv methods that makes them so slow.

> rbind(method1, method2, method3, method4)
        user.self sys.self elapsed
method1      0.56     0.39    1.35
method2      0.42     1.98   13.96
method3      0.36     2.25   14.69
method4      0.34     0.67    1.74
Community
  • 1
  • 1
Lloyd Christmas
  • 1,016
  • 6
  • 15
  • 5
    I think `data.table::fread()` is the fastest one – cirofdo May 17 '17 at 19:46
  • 2
    Followed by `data.table::rbindlist()` to glue them together. – Dirk Eddelbuettel May 17 '17 at 19:46
  • I added the `data.table` method to the question, which is fast. My question still stands as to why the `read_csv` method is so slow. – Lloyd Christmas May 17 '17 at 20:06
  • If you can nail this down into a very concrete, reproducible example with a single file that you're willing to share, you might find it more productive to raise this as an issue on github. If it's reproducible, Hadley would likely be very interested in a performance gap of that magnitude with base R. – joran May 17 '17 at 20:31
  • I will try to reframe more concretely. It may be hard with a single file though, to do this test I used a directory with 140 CSV files. With a single file the time would basically be zero. Should I delete this and repost a new question? – Lloyd Christmas May 17 '17 at 20:34
  • Well, I just think your investigation here is really only partially started. I'd spend some time profiling your options to see where time is really being spent in each case. If the time is really being sunk in `read_csv` you ought to be able to reproduce the speed difference in reading one giant csv without all the lapply and binding stuff. – joran May 17 '17 at 20:41
  • But without your files, etc, it's hard for anyone else to do that for you or help, if you know what I mean... – joran May 17 '17 at 20:42
  • The binding piece is also important, as it could impact computation time. I'll spend some time with the profiler as you suggest and reframe the question in a more reproducible way. – Lloyd Christmas May 17 '17 at 20:57
  • 1
    Have you tried specifying column types? Another useful way to do this I saw recently is querying local Apache Drill via [sergeant](https://github.com/hrbrmstr/sergeant) (see [here](https://twitter.com/hrbrmstr/status/863031565051465728)). Getting it installed is more work than the above, but it's powerful once it is. – alistaire May 17 '17 at 21:42
  • Couple of points - as others have said it's difficult to replicate without at least a sample of the files you are reading. - It would be good practice to pass PTfiles as a parameter in your function(s) rather than reading it from global env - Instead of lapply have a look at [furrr](https://github.com/DavisVaughan/furrr) which should work well with so many files (and is independant of which read function you end up using!) – Moohan Aug 12 '19 at 15:16
  • Look at this answer: [Comparing speed of fread vs. read.table for reading the first 1M rows out of 100M](https://stackoverflow.com/a/32263765/1250706). – fmassica Oct 08 '19 at 17:58
  • Perhaps try using map_dfr() with read_csv. This seems to be the most intuitive way to read and iterate with the tidyverse. e.g. `map_dfr(list.files(), read_csv)` – thus__ Feb 10 '20 at 01:42

2 Answers2

1

I would do that in the terminal(Unix). I would put all files int the same folder and then navigate to that folder (in terminal), the use the following command to create only one CSV file:

cat *.csv > merged_csv_file.csv

One observation regarding this method is that the header of each file will show up in the middle of the observations. To solve this I would suggest you do:

Get just the header from the first file

head -2 file1.csv > merged_csv_file.csv

then skip the first "X" lines from the other files, with the folling command, where "X" is the number of lines to skip.

tail -n +3 -q file*.csv >> merged_csv_file.csv

-n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.

Marcio Rodrigues
  • 319
  • 1
  • 11
  • I'd be very careful merging files without considering the header, because you never know if the odd file doesn't have a differently composed header. A safer command-line tool that does consider headers would be `xsv cat`: https://github.com/BurntSushi/xsv – dlaehnemann Aug 18 '22 at 11:21
0

I might have found a related issue. I am reading in nested CSV data from some simulation output, where multiple columns have CSV formatted data as elements, which I need to unnest and reshape for analysis.

With simulations where I have many runs, this resulted in thousands of elements that needed to be parsed. Using map(.,read_csv) this would take hours to transform. When I rewrote my script to apply read.csv in a lambda function, the operation would complete in seconds.

I'm curious if there is some intermediate system I/O operation or error handling that creates a bottleneck you wouldn't run into with a single input file.

jlamb
  • 23
  • 3
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 06 '22 at 23:01
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/31004994) – xilliam Feb 11 '22 at 09:13