2

I have a large number of data files (>1000) in a single directory. I would like to merge them all in a single data frame in R. They all have the same number and types of columns. So far what I have is:

setwd("directory")
files <- list.files()
for (i in 1:length(files)) assign(files[i], read.csv(files[i]))

This creates data frames for each of the 1000+ files. Is there any way to merge them, without having to type out a list of all 1000+ file names?

Any help would be appreciated!

nicola
  • 24,005
  • 3
  • 35
  • 56
Oksanna88
  • 31
  • 2
  • 5
    I'd like to know who teaches to use `assign` in R to newbies. DO NOT use `assign`. Just try `do.call(rbind,lapply(files,read.csv))`. – nicola Jul 25 '15 at 08:29
  • check [this](http://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list) and [this](http://stackoverflow.com/questions/31480615/data-table-left-outer-join-on-multiple-tables/31481121#31481121) too – Veerendra Gadekar Jul 25 '15 at 08:29
  • Do you need to 'merge' or are you appending? And if you are merging, what variable are you merging on? – Jason Jul 25 '15 at 10:02
  • @nicola why not use `assign` in this case? – SabDeM Jul 25 '15 at 13:18

2 Answers2

12

The standard way to do this with data.table (recommended because of its speed) is:

library(data.table)
data <- rbindlist(lapply(list.files(), fread))

There are also additional functionalities, e.g.

rbindlist(lapply(list.files(), fread), fill = TRUE)

Will take care of the possibility that some or many of your files have different column names--any non-overlap will be filled with NA in those files lacking that column.


EDIT: as @nicola mentioned, using assign is to be avoided in general unless you really know what you're doing.

See this post for further reference to that end.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • That's a great solution. What if I wanted to bind files adding new columns instead of new lines? Is there an equivalent `cbindlist` ? – rafa.pereira Sep 14 '15 at 15:32
  • @RafaelPereira that's really a different question and i'm sure that it's been asked. in short it won't be as simple as `cbindlist`, but off the top of my head there are two options: 1) use `rbindlist` and then `dcast` the result; 2) use the `on` option and the `:=` operator to sequentially merge the columns. One point to keep in mind is that a merge is essential for adding columns because we should think of the row as a sacrosanct observation--that is, the values associated with each ID are tied specifically to that ID, i.e. order matters, which is a contrast w.r.t. adding rows. – MichaelChirico Sep 14 '15 at 17:51
3

One good way to do that is to utilize data.table. This library has two benefits that will work in your case: a) it has a fast way of reading .csv files, and b) a fast way of combining data.tables (which are an extension of data.frame) into one. So in this spirit, let me propose the following alternative:

# if you don't have data.table installed, run install.packages('data.table') first
library(data.table)
files <- list.files('directory', full.names = TRUE)
#create a list to manage the individual files, only used to merge them in the end
FILES_LIST=vector("list",length(files)) 
for (i in 1:length(files)) {
    FILES_LIST[[i]]<-fread(files[i]) #this reads your .csv file
}
FILES_LIST = rbindlist(FILES_LIST) #this merges all of your files in a big data.table

The variable you are interested in, in the end is FILES_LIST.

I hope this helps.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Nikos
  • 3,267
  • 1
  • 25
  • 32