51

I have hundreds of medium sized Excel files (between 5000 and 50.0000 rows with about 100 columns) to load into R. They have a well-defined naming pattern, like x_1.xlsx, x_2.xlsx, etc.

How can I load these files into R in the fastest, most straightforward way?

Jaap
  • 81,064
  • 34
  • 182
  • 193
Manuel R
  • 3,976
  • 4
  • 28
  • 41

1 Answers1

112

With list.files you can create a list of all the filenames in your workingdirectory. Next you can use lapply to loop over that list and read each file with the read_excel function from the readxl package:

library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)

This method can off course also be used with other file reading functions like read.csv or read.table. Just replace read_excel with the appropriate file reading function and make sure you use the correct pattern in list.files.

If you also want to include the files in subdirectories, use:

file.list <- list.files(pattern='*.xlsx', recursive = TRUE)

Other possible packages for reading Excel-files: openxlsx & xlsx


Supposing the columns are the same for each file, you can bind them together in one dataframe with bind_rows from :

library(dplyr)
df <- bind_rows(df.list, .id = "id")

or with rbindlist from :

library(data.table)
df <- rbindlist(df.list, idcol = "id")

Both have the option to add a id column for identifying the separate datasets.


Update: If you don't want a numeric identifier, just use sapply with simplify = FALSE to read the files in file.list:

df.list <- sapply(file.list, read.csv, simplify=FALSE)

When using bind_rows from or rbindlist from , the id column now contains the filenames.

Even another approach is using the purrr-package:

library(purrr)
file.list <- list.files(pattern='*.csv')
file.list <- setNames(file.list, file.list) # only needed when you need an id-column with the file-names

df <- map_df(file.list, read.csv, .id = "id")

Other approaches to getting a named list: If you don't want just a numeric identifier, than you can assign the filenames to the dataframes in the list before you bind them together. There are several ways to do this:

# with the 'attr' function from base R
attr(df.list, "names") <- file.list
# with the 'names' function from base R
names(df.list) <- file.list
# with the 'setattr' function from the 'data.table' package
setattr(df.list, "names", file.list)

Now you can bind the list of dataframes together in one dataframe with rbindlist from data.table or bind_rows from dplyr. The id column will now contain the filenames instead of a numeric indentifier.

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • package with two functions that reads one type of file, hadley sure has outdone himself this time. – rawr Oct 01 '15 at 13:37
  • @rawr indeed, he could have integrated it in his `readr` package for example – Jaap Oct 01 '15 at 13:38
  • Working in R can be so elegant... This sure is a clean and simple way of doing it. However, unfortunately I am struggeling with this error: [link] (https://github.com/hadley/readxl/issues/80) so I probably have to use the (much) slower `read.xlsx` function ...which is a bit of a pain. Having said that: is there a way of converting xlsx to .csv without opening and saving each file one by one? – Manuel R Oct 01 '15 at 14:01
  • 1
    `read.xlsx` might be slower, but this is a process you could probably run in parallel, so long as all of the cores are on the local machine (or have access to the same shared directory, perhaps). That might at least make up for some of the lost time. Caveat: I haven't tried reading files in parallel before, but don't see a reason why it shouldn't work--again, so long as all the cores are on the same machine. – Benjamin Oct 01 '15 at 14:21
  • @ManuelS As an alternative you can use the `read.xlsx` function from the [`openxlsx`-package](https://cran.r-project.org/web/packages/openxlsx/index.html) instead of the `read.xlsx` function from the `xlsx`-package. – Jaap Oct 01 '15 at 14:37
  • @Jaap Thanks. I have settled for your solution using `read.xlsx` from the `openxlsx` package as long as the issue related to `read_excel` prevails. – Manuel R Oct 02 '15 at 06:16
  • What if all the files are in a different directory than the '.R' file? – user177196 Jun 20 '22 at 23:17
  • 1
    @user177196 The `list.files`-function has a `path` parameter with which you can specify the directory in which has to be searched for files; see also `?list.files`. – Jaap Jun 21 '22 at 12:47
  • @Jaap Many thanks. I will look more into that `path` parameter. – user177196 Jun 21 '22 at 14:51