323

Suppose we have a folder containing multiple data.csv files, each containing the same number of variables but each from different times. Is there a way in R to import them all simultaneously rather than having to import them all individually?

My problem is that I have around 2000 data files to import and having to import them individually just by using the code:

read.delim(file="filename", header=TRUE, sep="\t")

is not very efficient.

Braiam
  • 1
  • 11
  • 47
  • 78
Jojo
  • 4,951
  • 7
  • 23
  • 27
  • 4
    I hope this helps someone: There are orders of magnitude speed differences between these answers. Accepted answer (2012): base R; extremely slow. #2 (2016): 4 page research paper describing and testing base, tidyverse, and data.table methods; the last is 4x as fast. #3 (2014): succinct answer showing base, tidyverse, and data.table methods. #4 (me, 2019) expands on Spacedman for shell newbs / adds common cases e.g. csvs with headers. #5 (Spacedman, 2012): bash/shell/unix way; hundreds of times faster for many files. #6 (2014) maybe relevant for Windows users. – webb Jun 29 '22 at 16:50

16 Answers16

376

Something like the following should result in each data frame as a separate element in a single list:

temp = list.files(pattern="*.csv")
myfiles = lapply(temp, read.delim)

This assumes that you have those CSVs in a single directory--your current working directory--and that all of them have the lower-case extension .csv.

If you then want to combine those data frames into a single data frame, see the solutions in other answers using things like do.call(rbind,...), dplyr::bind_rows() or data.table::rbindlist().

If you really want each data frame in a separate object, even though that's often inadvisable, you could do the following with assign:

temp = list.files(pattern="*.csv")
for (i in 1:length(temp)) assign(temp[i], read.csv(temp[i]))

Or, without assign, and to demonstrate (1) how the file name can be cleaned up and (2) show how to use list2env, you can try the following:

temp = list.files(pattern="*.csv")
list2env(
  lapply(setNames(temp, make.names(gsub("*.csv$", "", temp))), 
         read.csv), envir = .GlobalEnv)

But again, it's often better to leave them in a single list.

joran
  • 169,992
  • 32
  • 429
  • 468
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    Thanks! this works very well...how would I go about naming each file I have just imported so I can easily call them up? – Jojo Jul 11 '12 at 13:34
  • if you can show us the first few lines of some of your files we might have some suggestions - edit your question for that! – Spacedman Jul 11 '12 at 14:13
  • 2
    The above code works perfectly for importing them as single objects but when I try to call up a column from the data set it doesnt recognise it as it is only a single object not a data frame i.e. my version of the above code is: setwd('C:/Users/new/Desktop/Dives/0904_003') temp<-list.files(pattern="*.csv") ddives <- lapply(temp, read.csv) So now each file is called ddives[n] but how would I go about writing a loop to make them all data frames rather than single objects? I can achieve this individually using the data.frame operator but am unsure as to how to loop this. @mrdwab – Jojo Jul 11 '12 at 15:07
  • @JosephOnoufriou, see my update. But generally, I find working with lists easier if I'm going to be doing similar calculations on all data frames. – A5C1D2H2I1M1N2O1R2T1 Jul 11 '12 at 15:56
  • 2
    For anyone trying to write a function to do the updated version of this answer using `assign`... If you want the assigned values to reside in the global environment, make sure you set `inherits=T`. – dnlbrky Apr 30 '14 at 02:06
  • I'm getting an error with the `"*.csv"` pattern: `invalid 'pattern' regular expression`. To fix this, I use the following pattern: `pattern=".*\\.csv"`. The error was caused, I think, by the leading asterisk, which (on my machine at least, with R 2.3.1) needs a character / symbol to precede it. The pattern I used works as follows: first, it starts with `.`, denoting any character. The `*` then repeats the preceding `.` as needed. Lastly, the pattern ends with a `.csv` literal, where the `.` in `.csv` is escaped by the double backslash `\\`. – william_grisaitis Jan 19 '15 at 20:12
  • This is an excellent solution that I'm currently using. If I can take the liberty of suggesting a way to clean the file names I would provide `df_name <- sub("\\.[[:alnum:]]+$", "", basename(as.character(files[i])))` and then pass the `df_name` in the loop instead of the `files[i]`, on the lines: `assign(df_name, read.csv(files[i]))` which works like a charm. – Konrad Jun 05 '15 at 09:02
  • Great solutions.. Thanks . I am facing issue while reading "myfiles" variable data.. it showing error like "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 256992, 281250, 143140, 1220" – Niks Dec 05 '17 at 05:27
  • Apologies for the heavy edits, but I've grown tired of people finding this question and not being able to get past this first answer to see the other options (combining into a single data frame, avoiding assign). – joran Aug 03 '18 at 19:33
  • It might be worth noting that the example pattern `"*.csv"` looks at first glance like an OS wildcard. It is, in fact, a regular expression, so if you're intending something like `"july-*.csv"` to match `july-2.csv` and `july-31.csv`, the pattern would look like `pattern="july-[0-9]+.csv"` instead. – Rakurai Aug 14 '18 at 13:01
  • Hello, Hi can I make this work ? `"temp = list.files(path ="C:/Users/luisf/Dropbox/etc/e tc/ e t c /", pattern="*.xlsx") list2env( lapply(setNames(temp, make.names(gsub("*.xlsx$", "", temp))), read_xlsx), envir = .GlobalEnv)"` . I'm receiving `Error: `path` does not exist: ` – Luis Aug 05 '21 at 14:25
  • @Luis, Just guessing right now, but maybe you need to [normalize the paths](https://stat.ethz.ch/R-manual/R-devel/library/base/html/normalizePath.html) first. – A5C1D2H2I1M1N2O1R2T1 Aug 06 '21 at 18:10
  • @A5C1D2H2I1M1N2O1R2T1, thanks! I received a lot of insightful answers here. In case you (or other) want to check: https://stackoverflow.com/questions/68669731/looping-through-the-global-environment-and-rename-the-dataframes-to-lowercase-wi/68670020#68670020 – Luis Aug 07 '21 at 19:04
  • Hi, i use the first command but added a path in the first row: `list.files (path = "...", pattern = "*.csv")` The rest stayed the same. It creates temp with all the correct files. However, in the second row i get the error `Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file 'enaho_2004_M7_edited.csv': No such file or directory`. Since i am still quite new with R, i don't know where the problem is. Thank you! – VBo Dec 06 '22 at 13:43
243

A speedy and succinct tidyverse solution: (more than twice as fast as Base R's read.csv)

tbl <-
    list.files(pattern = "*.csv") %>% 
    map_df(~read_csv(.))

and data.table's fread() can even cut those load times by half again. (for 1/4 the Base R times)

library(data.table)

tbl_fread <- 
    list.files(pattern = "*.csv") %>% 
    map_df(~fread(.))

The stringsAsFactors = FALSE argument keeps the dataframe factor free, (and as marbel points out, is the default setting for fread)

If the typecasting is being cheeky, you can force all the columns to be as characters with the col_types argument.

tbl <-
    list.files(pattern = "*.csv") %>% 
    map_df(~read_csv(., col_types = cols(.default = "c")))

If you are wanting to dip into subdirectories to construct your list of files to eventually bind, then be sure to include the path name, as well as register the files with their full names in your list. This will allow the binding work to go on outside of the current directory. (Thinking of the full pathnames as operating like passports to allow movement back across directory 'borders'.)

tbl <-
    list.files(path = "./subdirectory/",
               pattern = "*.csv", 
               full.names = T) %>% 
    map_df(~read_csv(., col_types = cols(.default = "c"))) 

As Hadley describes here (about halfway down):

map_df(x, f) is effectively the same as do.call("rbind", lapply(x, f))....

Bonus Feature - adding filenames to the records per Niks feature request in comments below:
* Add original filename to each record.

Code explained: make a function to append the filename to each record during the initial reading of the tables. Then use that function instead of the simple read_csv() function.

read_plus <- function(flnm) {
    read_csv(flnm) %>% 
        mutate(filename = flnm)
}

tbl_with_sources <-
    list.files(pattern = "*.csv", 
               full.names = T) %>% 
    map_df(~read_plus(.))

(The typecasting and subdirectory handling approaches can also be handled inside the read_plus() function in the same manner as illustrated in the second and third variants suggested above.)

### Benchmark Code & Results 
library(tidyverse)
library(data.table)
library(microbenchmark)

### Base R Approaches
#### Instead of a dataframe, this approach creates a list of lists
#### removed from analysis as this alone doubled analysis time reqd
# lapply_read.delim <- function(path, pattern = "*.csv") {
#     temp = list.files(path, pattern, full.names = TRUE)
#     myfiles = lapply(temp, read.delim)
# }

#### `read.csv()`
do.call_rbind_read.csv <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))
}

map_df_read.csv <- function(path, pattern = "*.csv") {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_df(~read.csv(., stringsAsFactors = FALSE))
}


### *dplyr()*
#### `read_csv()`
lapply_read_csv_bind_rows <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    lapply(files, read_csv) %>% bind_rows()
}

map_df_read_csv <- function(path, pattern = "*.csv") {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_df(~read_csv(., col_types = cols(.default = "c")))
}

### *data.table* / *purrr* hybrid
map_df_fread <- function(path, pattern = "*.csv") {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_df(~fread(.))
}

### *data.table*
rbindlist_fread <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    rbindlist(lapply(files, function(x) fread(x)))
}

do.call_rbind_fread <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    do.call(rbind, lapply(files, function(x) fread(x, stringsAsFactors = FALSE)))
}


read_results <- function(dir_size){
    microbenchmark(
        # lapply_read.delim = lapply_read.delim(dir_size), # too slow to include in benchmarks
        do.call_rbind_read.csv = do.call_rbind_read.csv(dir_size),
        map_df_read.csv = map_df_read.csv(dir_size),
        lapply_read_csv_bind_rows = lapply_read_csv_bind_rows(dir_size),
        map_df_read_csv = map_df_read_csv(dir_size),
        rbindlist_fread = rbindlist_fread(dir_size),
        do.call_rbind_fread = do.call_rbind_fread(dir_size),
        map_df_fread = map_df_fread(dir_size),
        times = 10L) 
}

read_results_lrg_mid_mid <- read_results('./testFolder/500MB_12.5MB_40files')
print(read_results_lrg_mid_mid, digits = 3)

read_results_sml_mic_mny <- read_results('./testFolder/5MB_5KB_1000files/')
read_results_sml_tny_mod <- read_results('./testFolder/5MB_50KB_100files/')
read_results_sml_sml_few <- read_results('./testFolder/5MB_500KB_10files/')

read_results_med_sml_mny <- read_results('./testFolder/50MB_5OKB_1000files')
read_results_med_sml_mod <- read_results('./testFolder/50MB_5OOKB_100files')
read_results_med_med_few <- read_results('./testFolder/50MB_5MB_10files')

read_results_lrg_sml_mny <- read_results('./testFolder/500MB_500KB_1000files')
read_results_lrg_med_mod <- read_results('./testFolder/500MB_5MB_100files')
read_results_lrg_lrg_few <- read_results('./testFolder/500MB_50MB_10files')

read_results_xlg_lrg_mod <- read_results('./testFolder/5000MB_50MB_100files')


print(read_results_sml_mic_mny, digits = 3)
print(read_results_sml_tny_mod, digits = 3)
print(read_results_sml_sml_few, digits = 3)

print(read_results_med_sml_mny, digits = 3)
print(read_results_med_sml_mod, digits = 3)
print(read_results_med_med_few, digits = 3)

print(read_results_lrg_sml_mny, digits = 3)
print(read_results_lrg_med_mod, digits = 3)
print(read_results_lrg_lrg_few, digits = 3)

print(read_results_xlg_lrg_mod, digits = 3)

# display boxplot of my typical use case results & basic machine max load
par(oma = c(0,0,0,0)) # remove overall margins if present
par(mfcol = c(1,1)) # remove grid if present
par(mar = c(12,5,1,1) + 0.1) # to display just a single boxplot with its complete labels
boxplot(read_results_lrg_mid_mid, las = 2, xlab = "", ylab = "Duration (seconds)", main = "40 files @ 12.5MB (500MB)")
boxplot(read_results_xlg_lrg_mod, las = 2, xlab = "", ylab = "Duration (seconds)", main = "100 files @ 50MB (5GB)")

# generate 3x3 grid boxplots
par(oma = c(12,1,1,1)) # margins for the whole 3 x 3 grid plot
par(mfcol = c(3,3)) # create grid (filling down each column)
par(mar = c(1,4,2,1)) # margins for the individual plots in 3 x 3 grid
boxplot(read_results_sml_mic_mny, las = 2, xlab = "", ylab = "Duration (seconds)", main = "1000 files @ 5KB (5MB)", xaxt = 'n')
boxplot(read_results_sml_tny_mod, las = 2, xlab = "", ylab = "Duration (milliseconds)", main = "100 files @ 50KB (5MB)", xaxt = 'n')
boxplot(read_results_sml_sml_few, las = 2, xlab = "", ylab = "Duration (milliseconds)", main = "10 files @ 500KB (5MB)",)

boxplot(read_results_med_sml_mny, las = 2, xlab = "", ylab = "Duration (microseconds)        ", main = "1000 files @ 50KB (50MB)", xaxt = 'n')
boxplot(read_results_med_sml_mod, las = 2, xlab = "", ylab = "Duration (microseconds)", main = "100 files @ 500KB (50MB)", xaxt = 'n')
boxplot(read_results_med_med_few, las = 2, xlab = "", ylab = "Duration (seconds)", main = "10 files @ 5MB (50MB)")

boxplot(read_results_lrg_sml_mny, las = 2, xlab = "", ylab = "Duration (seconds)", main = "1000 files @ 500KB (500MB)", xaxt = 'n')
boxplot(read_results_lrg_med_mod, las = 2, xlab = "", ylab = "Duration (seconds)", main = "100 files @ 5MB (500MB)", xaxt = 'n')
boxplot(read_results_lrg_lrg_few, las = 2, xlab = "", ylab = "Duration (seconds)", main = "10 files @ 50MB (500MB)")

Middling Use Case

Boxplot Comparison of Elapsed Time my typical use case

Larger Use Case

Boxplot Comparison of Elapsed Time for Extra Large Load

Variety of Use Cases

Rows: file counts (1000, 100, 10)
Columns: final dataframe size (5MB, 50MB, 500MB)
(click on image to view original size) Boxplot Comparison of Directory Size Variations

The base R results are better for the smallest use cases where the overhead of bringing the C libraries of purrr and dplyr to bear outweigh the performance gains that are observed when performing larger scale processing tasks.

if you want to run your own tests you may find this bash script helpful.

for ((i=1; i<=$2; i++)); do 
  cp "$1" "${1:0:8}_${i}.csv";
done

bash what_you_name_this_script.sh "fileName_you_want_copied" 100 will create 100 copies of your file sequentially numbered (after the initial 8 characters of the filename and an underscore).

Attributions and Appreciations

With special thanks to:

  • Tyler Rinker and Akrun for demonstrating microbenchmark.
  • Jake Kaupp for introducing me to map_df() here.
  • David McLaughlin for helpful feedback on improving the visualizations and discussing/confirming the performance inversions observed in the small file, small dataframe analysis results.
  • marbel for pointing out the default behavior for fread(). (I need to study up on data.table.)
leerssej
  • 14,260
  • 6
  • 48
  • 57
  • 2
    you solution works for me. In this I want to store that file name to differentiate them.. Is it possible ? – Niks Dec 05 '17 at 06:19
  • 2
    @Niks - Certainly! Just write and swap in a little function that not only reads the files but immediately appends a filename to each record read. Like so `readAddFilename <- function(flnm) { read_csv(flnm) %>% mutate(filename = flnm) }` Then just drop that in to the `map_df` instead of the simple read only `read_csv()` that is there now. I can update the entry above to show the function and how it would fit into the pipe if you still have questions or you think that will be helpful. – leerssej Dec 10 '17 at 01:04
  • 1
    The problem in practice is that `read_csv` is much more slower than `fread`. I would include a benchmark if you are going to say something is faster. One idea is creating 30 1GB files and reading them, that would be a case where performance matters. – marbel Mar 16 '18 at 03:26
  • @marbel: Thank you for the suggestion! On 530 MB and smaller directories (with up to 100 files) I am finding a 25% improvement in performance between *data.table*'s `fread()` and *dplyr*'s `read_csv()`: 14.2 vs 19.9 secs. TBH, I had only been comparing base R to dplyr and as `read_csv()` is around 2-4x faster than the `read.csv()`, benchmarking didn't seem necessary. It has however been interesting to give `fread()` a whirl and pause to check out more complete benchmark results. Thanks again! – leerssej Apr 24 '18 at 20:04
  • on my experience `fread` just works. I never needed something else. – marbel Apr 27 '18 at 05:02
  • It would be interesting to see how my answer using `rio::import_list` compares. –  May 09 '18 at 10:31
  • @leerssej my proposal was to create x amount of files where each has 1 GB. 500MB is tiny data for today standards. But conceptually it's not surprising that `rbindlist` + `fread` is among the faster options. – marbel Nov 14 '18 at 03:02
  • Instead of `read_plus` you could also set the names of the filename list, then use the `.id` column in `map_df`: https://github.com/STAT545-UBC/Discussion/issues/398#issuecomment-258517022 – slhck Apr 24 '19 at 08:59
  • Also `rbindlist(lapply(files, fread))` is possible instead of `rbindlist(lapply(files, function(x) fread(x)))`. – marbel Jul 10 '19 at 19:24
  • 1
    Another great point. I think when I wrote that I was being a bit too careful about protecting data.table activities from mutating the data in place (which affects performance for the next and all subsequent runs across the data). That of course doesn't make sense in this instance. Thank you. :-D Looking forward to running the numbers again soon without the functions and with larger datasets with a bigger machine. – leerssej Jul 10 '19 at 19:37
  • Something you may want to consider including: `tibble(path = dir(pattern = "\\.csv$")) %>% rowwise(path) %>% summarize(read_csv(path))` – this is possible from dplyr 1.0.0 onwards (https://www.tidyverse.org/blog/2020/03/dplyr-1-0-0-summarise/). – slhck Apr 27 '20 at 12:27
  • Suggested code works well with excel files, using **readxl** package, though the column type (if you have type mismatch issues) is specified slightly diffently `read_excel(flnm, col_types = "text")` – Mark Neal Jun 15 '20 at 23:33
  • 1
    In regards to your bonus feature, I found it advantageous to create a dataframe `ls` out of the list of file paths with column `filename` and then to `ls %>% mutate(data = map(filename, read_csv))` This avoids the side-effect of row-binding of adding in columns that are not present in all inputs. – its.me.adam Mar 29 '21 at 17:03
  • 1
    I made an [update](https://stackoverflow.com/a/76362956/1349673) since `map_df` has now been superseded. – James Hirschorn May 30 '23 at 08:31
120

Here are some options to convert the .csv files into one data.frame using R base and some of the available packages for reading files in R.

This is slower than the options below.

# Get the files names
files = list.files(pattern="*.csv")
# First apply read.csv, then rbind
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))

Edit: - A few more extra choices using data.table and readr

A fread() version, which is a function of the data.table package. This is by far the fastest option in R.

library(data.table)
DT = do.call(rbind, lapply(files, fread))
# The same using `rbindlist`
DT = rbindlist(lapply(files, fread))

Using readr, which is another package for reading csv files. It's slower than fread, faster than base R but has different functionalities.

library(readr)
library(dplyr)
tbl = lapply(files, read_csv) %>% bind_rows()
marbel
  • 7,560
  • 6
  • 49
  • 68
  • 2
    how does this perform vs. Reduce(rbind, lapply(...))? Just learning R but my guess is less performant – aaron Jul 18 '14 at 16:37
  • 4
    I've added a `data.table` version, that should improve performance. – marbel May 18 '15 at 00:55
  • Is it possible to read only specific files? e.x. Files that contain 'weather' in the name? – derelict Aug 08 '16 at 16:41
  • 1
    found it here: http://stackoverflow.com/questions/10353540/listing-all-files-matching-a-full-path-pattern-in-r thanks. – derelict Aug 08 '16 at 17:24
  • 1
    +1 seems like producing a single data frame -- the SQL UNION of all CSV files -- is the easiest to work with. Since OP didn't specify whether they want 1 data frame or many data frames, I assumed 1 data frame is best, so I am surprised the accepted answer does not do any of the "UNION". I like this answer, which is consistent with [this explanation of `do.call`](http://www.stat.berkeley.edu/~s133/Docall.html) – Nate Anderson Sep 04 '16 at 16:36
  • If you are calling files outside of your working directory, be sure to add `full.names = TRUE` to `list.files()` e.g. `list.files(path = "C:/Users",pattern = ".txt",full.names=TRUE)`. This will attach the full file path to each called file, allowing the `lapply` function to successfully locate and operate on each file. – TheSciGuy May 02 '19 at 15:56
  • @marbel or anybody, do you know how to adapt any of these (great!) solutions (hopefully the last one using ´read_csv´) to a csv file that uses semicolon as separator (I'm working in France...) – lomper Aug 20 '22 at 13:38
39

With many files and many cores, fread xargs cat (described below) is about 50x faster than the fastest solution in the top 3 answers.

rbindlist lapply read.delim  500s <- 1st place & accepted answer
rbindlist lapply fread       250s <- 2nd & 3rd place answers
rbindlist mclapply fread      10s
fread xargs cat                5s

Time to read 121401 csvs into a single data.table. Each time is an average of three runs then rounded. Each csv has 3 columns, one header row, and, on average, 4.510 rows. Machine is a GCP VM with 96 cores.

The top three answers by @A5C1D2H2I1M1N2O1R2T1, @leerssej, and @marbel and are all essentially the same: apply fread (or read.delim) to each file, then rbind/rbindlist the resulting data.tables. For small datasets, I usually use the rbindlist(lapply(list.files("*.csv"),fread)) form. For medium-sized datasets, I use parallel's mclapply instead of lapply, which is much faster if you have many cores.

This is better than other R-internal alternatives, but not the best for a large number of small csvs when speed matters. In that case, it can be much faster to first use cat to first concatenate all the csvs into one csv, as in @Spacedman's answer. I'll add some detail on how to do this from within R:

x = fread(cmd='cat *.csv', header=F)

However, what if each csv has a header?

x = fread(cmd="awk 'NR==1||FNR!=1' *.csv", header=T)

And what if you have so many files that the *.csv shell glob fails?

x = fread(cmd='find . -name "*.csv" | xargs cat', header=F)

And what if all files have a header AND there are too many files?

header = fread(cmd='find . -name "*.csv" | head -n1 | xargs head -n1', header=T)
x = fread(cmd='find . -name "*.csv" | xargs tail -q -n+2', header=F)
setnames(x,header)

And what if the resulting concatenated csv is too big for system memory? (e.g., /dev/shm out of space error)

system('find . -name "*.csv" | xargs cat > combined.csv')
x = fread('combined.csv', header=F)

With headers?

system('find . -name "*.csv" | head -n1 | xargs head -n1 > combined.csv')
system('find . -name "*.csv" | xargs tail -q -n+2 >> combined.csv')
x = fread('combined.csv', header=T)

Finally, what if you don't want all .csv in a directory, but rather a specific set of files? (Also, they all have headers.) (This is my use case.)

fread(text=paste0(system("xargs cat|awk 'NR==1||$1!=\"<column one name>\"'",input=paths,intern=T),collapse="\n"),header=T,sep="\t")

and this is about the same speed as plain fread xargs cat :)

Note: for data.table pre-v1.11.6 (19 Sep 2018), omit the cmd= from fread(cmd=.

To sum up, if you're interested in speed, and have many files and many cores, fread xargs cat is about 50x faster than the fastest solution in the top 3 answers.

Update: here is a function I wrote to easily apply the fastest solution. I use it in production in several situations, but you should test it thoroughly with your own data before trusting it.

fread_many = function(files,header=T,...){
  if(length(files)==0) return()
  if(typeof(files)!='character') return()
  files = files[file.exists(files)]
  if(length(files)==0) return()
  tmp = tempfile(fileext = ".csv")
  # note 1: requires awk, not cat or tail because some files have no final newline
  # note 2: parallel --xargs is 40% slower
  # note 3: reading to var is 15% slower and crashes R if the string is too long
  # note 4: shorter paths -> more paths per awk -> fewer awks -> measurably faster
  #         so best cd to the csv dir and use relative paths
  if(header==T){
    system(paste0('head -n1 ',files[1],' > ',tmp))
    system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
  } else {
    system(paste0("xargs awk '1' > ",tmp),input=files)
  }
  DT = fread(file=tmp,header=header,...)
  file.remove(tmp)
  DT
}

Update 2: here is a more complicated version of the fread_many function for cases where you want the resulting data.table to include a column for the inpath of each csv. In this case, one must also explicitly specify the csv separator with the sep argument.

fread_many = function(files,header=T,keep_inpath=F,sep="auto",...){
  if(length(files)==0) return()
  if(typeof(files)!='character') return()
  files = files[file.exists(files)]
  if(length(files)==0) return()
  tmp = tempfile(fileext = ".csv")
  if(keep_inpath==T){
    stopifnot(sep!="auto")
    if(header==T){
      system(paste0('/usr/bin/echo -ne inpath"',sep,'" > ',tmp))
      system(paste0('head -n1 ',files[1],' >> ',tmp))
      system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{if(FNR>1)print FILENAME,$0}' >> ",tmp),input=files)
    } else {
      system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{print FILENAME,$0}' > ",tmp),input=files)
    }
  } else {
    if(header==T){
      system(paste0('head -n1 ',files[1],' > ',tmp))
      system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
    } else {
      system(paste0("xargs awk '1' > ",tmp),input=files)
    }
  }
  DT = fread(file=tmp,header=header,sep=sep,...)
  file.remove(tmp)
  DT
}

Caveat: all of my solutions that concatenate the csvs before reading them assumes they all have the same separator. If not all of your csvs use the same delimiter, instead use rbindlist lapply fread, rbindlist mclapply fread, or fread xargs cat in batches, where all csvs in a batch use the same separator.

webb
  • 4,180
  • 1
  • 17
  • 26
  • 2
    This is a great solution and deserves more credit than it's getting. – p-robot May 10 '21 at 09:22
  • 1
    thankyou for this thorough overview - this was hugely helpful. – sparco1500 Jul 20 '21 at 08:39
  • this is an awesome solution, I got away with just using the first single line that you shared, and it worked a treat (thank you!) but unfortunately I am ending up with all of the columns being classified as "character" although roughly 90% of them should be numeric. I think that your updates might contain the solution but I am too much of a beginner to apply them properly, can you please shed some light on why fread might be reading everything as chr and what I could try to get around it please? – Gilrob Aug 10 '23 at 05:39
27

As well as using lapply or some other looping construct in R you could merge your CSV files into one file.

In Unix, if the files had no headers, then its as easy as:

cat *.csv > all.csv

or if there are headers, and you can find a string that matches headers and only headers (ie suppose header lines all start with "Age"), you'd do:

cat *.csv | grep -v ^Age > all.csv

I think in Windows you could do this with COPY and SEARCH (or FIND or something) from the DOS command box, but why not install cygwin and get the power of the Unix command shell?

Spacedman
  • 92,590
  • 12
  • 140
  • 224
26

This is the code I developed to read all csv files into R. It will create a dataframe for each csv file individually and title that dataframe the file's original name (removing spaces and the .csv) I hope you find it useful!

path <- "C:/Users/cfees/My Box Files/Fitness/"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)), 
read.csv(paste(path,file,sep="")))
}
  • amazing. it worked pretty well (I've adapted to use with excel files). In case someone wants the code: `path = "C:/Users/luisf/ etc here/" files = list.files(path=path, pattern = "*.xlsx") for(file in files) { perpos <- which(strsplit(file, "")[[1]]==".") assign( gsub(" ","",substr(file, 1, perpos-1)), read_xlsx(paste(path,file,sep="")))` – Luis Aug 05 '21 at 14:49
14

In my view, most of the other answers are obsoleted by rio::import_list, which is a succinct one-liner:

library(rio)
my_data <- import_list(dir("path_to_directory", pattern = ".csv"), rbind = TRUE)

Any extra arguments are passed to rio::import. rio can deal with almost any file format R can read, and it uses data.table's fread where possible, so it should be fast too.

Kevin Wright
  • 2,397
  • 22
  • 29
6

Using plyr::ldply there is roughly a 50% speed increase by enabling the .parallel option while reading 400 csv files roughly 30-40 MB each. Example includes a text progress bar.

library(plyr)
library(data.table)
library(doSNOW)

csv.list <- list.files(path="t:/data", pattern=".csv$", full.names=TRUE)

cl <- makeCluster(4)
registerDoSNOW(cl)

pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))

stopCluster(cl)
manotheshark
  • 4,297
  • 17
  • 30
  • Nice answer! How do you pass additional arguments to `fread` or `user-defined functions`? Thanks! – Tung Apr 13 '20 at 12:54
  • 1
    @Tung Looking at `?ldply` shows `...` other arguments passed on to `.fun`. Using either `fread, skip = 100` or `function(x) fread(x, skip = 100)` would work – manotheshark Apr 13 '20 at 21:24
  • using `function(x) fread(x, skip = 100)` didn't work for me but providing additional args after bare function name did the trick. Thanks again! – Tung Apr 14 '20 at 16:08
6

Using purrr and including file IDs as a column:

library(tidyverse)


p <- "my/directory"
files <- list.files(p, pattern="csv", full.names=TRUE) %>%
    set_names()
merged <- files %>% map_dfr(read_csv, .id="filename")

Without set_names(), .id= will use integer indicators, instead of actual file names.

If you then want just the short filename without the full path:

merged <- merged %>% mutate(filename=basename(filename))
Paul
  • 3,920
  • 31
  • 29
  • "Error in set_names(.) : 1 argument passed to 'names<-' which requires 2", any idea? – slhck Feb 26 '21 at 11:10
  • Maybe this snippet requires at least 2 files, and you have only 1 file from `list.files()`? Not sure – Paul Mar 02 '21 at 17:35
  • No, I have more than one file there. `set_names` takes two arguments, so I'm not sure how your code even works. You might do `files <- list.files(…)` and then `files <- files %>% set_names(files)` – slhck Mar 02 '21 at 19:36
  • I just double checked, and code still checks out as written. Perhaps purrr or rlang needs to be updated, since that's where set_names() comes from. A single argument is also fine: "If you do nothing, x will be named with itself." per docs: https://rlang.r-lib.org/reference/set_names.html – Paul Mar 03 '21 at 17:08
  • ... also see note on "set_names" at https://purrr.tidyverse.org/reference/map.html – Paul Mar 03 '21 at 17:08
  • 2
    Hmm, my `set_names` function was masked from `magrittr`. Now it works! Thanks for checking back with me. – slhck Mar 03 '21 at 21:09
4

Building on dnlbrk's comment, assign can be considerably faster than list2env for big files.

library(readr)
library(stringr)

List_of_file_paths <- list.files(path ="C:/Users/Anon/Documents/Folder_with_csv_files/", pattern = ".csv", all.files = TRUE, full.names = TRUE)

By setting the full.names argument to true, you will get the full path to each file as a separate character string in your list of files, e.g., List_of_file_paths[1] will be something like "C:/Users/Anon/Documents/Folder_with_csv_files/file1.csv"

for(f in 1:length(List_of_filepaths)) {
  file_name <- str_sub(string = List_of_filepaths[f], start = 46, end = -5)
  file_df <- read_csv(List_of_filepaths[f])  
  assign( x = file_name, value = file_df, envir = .GlobalEnv)
}

You could use the data.table package's fread or base R read.csv instead of read_csv. The file_name step allows you to tidy up the name so that each data frame does not remain with the full path to the file as it's name. You could extend your loop to do further things to the data table before transferring it to the global environment, for example:

for(f in 1:length(List_of_filepaths)) {
  file_name <- str_sub(string = List_of_filepaths[f], start = 46, end = -5)
  file_df <- read_csv(List_of_filepaths[f])  
  file_df <- file_df[,1:3] #if you only need the first three columns
  assign( x = file_name, value = file_df, envir = .GlobalEnv)
}
4

This is my specific example to read multiple files and combine them into 1 data frame:

path<- file.path("C:/folder/subfolder")
files <- list.files(path=path, pattern="/*.csv",full.names = T)
library(data.table)
data = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))
vuminh91
  • 141
  • 1
  • 6
3

With readr 2.0.0 onwards, you can read multiple files in at once simply by providing a list of their paths to the file argument. Here is an example showing this with readr::read_csv().

packageVersion("readr")
#> [1] '2.0.1'
library(readr)
library(fs)

# create files to read in
write_csv(read_csv("1, 2 \n 3, 4", col_names = c("x", "y")), file = "file1.csv")
write_csv(read_csv("5, 6 \n 7, 8", col_names = c("x", "y")), file = "file2.csv")

# get a list of files
files <- dir_ls(".", glob = "file*csv")
files
#> file1.csv file2.csv

# read them in at once
# record paths in a column called filename
read_csv(files, id = "filename")
#> # A tibble: 4 × 3
#>   filename      x     y
#>   <chr>     <dbl> <dbl>
#> 1 file1.csv     1     2
#> 2 file1.csv     3     4
#> 3 file2.csv     5     6
#> 4 file2.csv     7     8

Created on 2021-09-16 by the reprex package (v2.0.1)

mine
  • 133
  • 1
  • 5
2

The following codes should give you the fastest speed for big data as long as you have many cores on your computer:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(doParallel, data.table, stringr)

# get the file name
dir() %>% str_subset("\\.csv$") -> fn

# use parallel setting
(cl <- detectCores() %>%
  makeCluster()) %>%
  registerDoParallel()

# read and bind all files together
system.time({
  big_df <- foreach(
    i = fn,
    .packages = "data.table"
  ) %dopar%
    {
      fread(i, colClasses = "character")
    } %>%
    rbindlist(fill = TRUE)
})

# end of parallel work
stopImplicitCluster(cl)

Updated in 2020/04/16: As I find a new package available for parallel computation, an alternative solution is provided using the following codes.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(future.apply, data.table, stringr)

# get the file name
dir() %>% str_subset("\\.csv$") -> fn

plan(multiprocess)

future_lapply(fn,fread,colClasses = "character") %>% 
  rbindlist(fill = TRUE) -> res

# res is the merged data.table
Hope
  • 109
  • 5
1

I like the approach using list.files(), lapply() and list2env() (or fs::dir_ls(), purrr::map() and list2env()). That seems simple and flexible.

Alternatively, you may try the small package {tor} (to-R): By default it imports files from the working directory into a list (list_*() variants) or into the global environment (load_*() variants).

For example, here I read all the .csv files from my working directory into a list using tor::list_csv():

library(tor)

dir()
#>  [1] "_pkgdown.yml"     "cran-comments.md" "csv1.csv"        
#>  [4] "csv2.csv"         "datasets"         "DESCRIPTION"     
#>  [7] "docs"             "inst"             "LICENSE.md"      
#> [10] "man"              "NAMESPACE"        "NEWS.md"         
#> [13] "R"                "README.md"        "README.Rmd"      
#> [16] "tests"            "tmp.R"            "tor.Rproj"

list_csv()
#> $csv1
#>   x
#> 1 1
#> 2 2
#> 
#> $csv2
#>   y
#> 1 a
#> 2 b

And now I load those files into my global environment with tor::load_csv():

# The working directory contains .csv files
dir()
#>  [1] "_pkgdown.yml"     "cran-comments.md" "CRAN-RELEASE"    
#>  [4] "csv1.csv"         "csv2.csv"         "datasets"        
#>  [7] "DESCRIPTION"      "docs"             "inst"            
#> [10] "LICENSE.md"       "man"              "NAMESPACE"       
#> [13] "NEWS.md"          "R"                "README.md"       
#> [16] "README.Rmd"       "tests"            "tmp.R"           
#> [19] "tor.Rproj"

load_csv()

# Each file is now available as a dataframe in the global environment
csv1
#>   x
#> 1 1
#> 2 2
csv2
#>   y
#> 1 a
#> 2 b

Should you need to read specific files, you can match their file-path with regexp, ignore.case and invert.


For even more flexibility use list_any(). It allows you to supply the reader function via the argument .f.

(path_csv <- tor_example("csv"))
#> [1] "C:/Users/LeporeM/Documents/R/R-3.5.2/library/tor/extdata/csv"
dir(path_csv)
#> [1] "file1.csv" "file2.csv"

list_any(path_csv, read.csv)
#> $file1
#>   x
#> 1 1
#> 2 2
#> 
#> $file2
#>   y
#> 1 a
#> 2 b

Pass additional arguments via ... or inside the lambda function.

path_csv %>% 
  list_any(readr::read_csv, skip = 1)
#> Parsed with column specification:
#> cols(
#>   `1` = col_double()
#> )
#> Parsed with column specification:
#> cols(
#>   a = col_character()
#> )
#> $file1
#> # A tibble: 1 x 1
#>     `1`
#>   <dbl>
#> 1     2
#> 
#> $file2
#> # A tibble: 1 x 1
#>   a    
#>   <chr>
#> 1 b

path_csv %>% 
  list_any(~read.csv(., stringsAsFactors = FALSE)) %>% 
  map(as_tibble)
#> $file1
#> # A tibble: 2 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 
#> $file2
#> # A tibble: 2 x 1
#>   y    
#>   <chr>
#> 1 a    
#> 2 b
Mauro Lepore
  • 119
  • 1
  • 5
1

It was requested that I add this functionality to the stackoverflow R package. Given that it is a tinyverse package (and can't depend on third party packages), here is what I came up with:

#' Bulk import data files 
#' 
#' Read in each file at a path and then unnest them. Defaults to csv format.
#' 
#' @param path        a character vector of full path names
#' @param pattern     an optional \link[=regex]{regular expression}. Only file names which match the regular expression will be returned.
#' @param reader      a function that can read data from a file name.
#' @param ...         optional arguments to pass to the reader function (eg \code{stringsAsFactors}).
#' @param reducer     a function to unnest the individual data files. Use I to retain the nested structure. 
#' @param recursive     logical. Should the listing recurse into directories?
#'  
#' @author Neal Fultz
#' @references \url{https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once}
#' 
#' @importFrom utils read.csv
#' @export
read.directory <- function(path='.', pattern=NULL, reader=read.csv, ..., 
                           reducer=function(dfs) do.call(rbind.data.frame, dfs), recursive=FALSE) {
  files <- list.files(path, pattern, full.names = TRUE, recursive = recursive)

  reducer(lapply(files, reader, ...))
}

By parameterizing the reader and reducer function, people can use data.table or dplyr if they so choose, or just use the base R functions that are fine for smaller data sets.

Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
1

An update to @leerssej's "speedy and succinct" tidyverse solution, since map_df has been superseded:

tbl <-
  list.files(pattern = "*.csv") |> 
  map((\(fn) read_csv(fn)) |>
  list_rbind()
James Hirschorn
  • 7,032
  • 5
  • 45
  • 53