Read all worksheets (as dataframes) from multiple Excel workbooks of different structure

Question

I understand that readxl can be used to read in multiple worksheets from a workbook. However, I am struggling to extend this and vectorise this across many workbooks with different sheet names and number of sheets and data therein.

I demonstrate using the Enron spreadsheet data which is a bunch of .xlsx files that I downloaded.

head(list.files("../data/enron_spreadsheets/"), 3)

[1] "albert_meyers__1__1-25act.xlsx"                           
[2] "albert_meyers__2__1-29act.xlsx"                           
[3] "andrea_ring__10__ENRONGAS(1200).xlsx"

To make it manageable, we sample.

# Set the path to your directory of Enron spreadsheets here
enron_path <- "../data/enron_spreadsheets/"
# Set the sample size for testing here
sample_size <- 100
all_paths <- list.files(enron_path,
                    full.names = TRUE)

# For testing, look at n (sample_size) random workbooks.
set.seed(1337)
sample_paths <- sample(all_paths, sample_size)

paths <- sample_paths

Inspecting these workbooks and counting the number of worksheets therein reveals they have different number of sheets and contain different data.

# purr package
# https://jennybc.github.io/purrr-tutorial/index.html
sheet_count <- purrr::map(paths, readxl::excel_sheets) %>%
  purrr::map(length) %>%
  unlist()

hist(sheet_count, main = "")

However, to load all the sheets in a workbook into a list of data frames, we need to:

Get worksheet names as a self-named character vector (these names propagate nicely).

Use purrr::map() to iterate sheet reading.

books <-
  dplyr::data_frame(filename = basename(paths),
             path = paths,
             sheet_name = purrr::map(paths, readxl::excel_sheets)
             ) %>%  
  dplyr::mutate(id = as.character(row_number()))

  books

# A tibble: 100 x 4
                             filename
                                <chr>
 1  kenneth_lay__19485__Mlp_1109.xlsx
 2 kate_symes__18980__SP 15 pages.xls
 3 chris_germany__1821__newpower-purc
 4 john_griffith__15991__Forwards Det
 5   jane_tholt__13278__bid2001A.xlsx
 6 gerald_nemec__11481__EOLfieldnames
 7 stacey_white__39009__Power RT Serv
 8      eric_saibi__9766__012302.xlsx
 9 david_delainey__8083__ENA Status o
10  daren_farmer__5035__HPLN0405.xlsx
# ... with 90 more rows, and 3
#   more variables: path <chr>,
#   sheet_name <list>, id <chr>

Here we have one row per workbook in books with the workbook's worksheet names stored in a list column. We want one row per worksheet with the data contents of the worksheet stored in a list column so that we can add extra features based on the worksheets data (the worksheet is the experimental unit). The problem is it doesn't vectorise as expected, am I missing something?

This errors...

sheets <-
  tibble::tibble("sheet_name" = unlist(books$sheet_name),
                 "path" = rep(paths,
                              times = unlist(
                                purrr::map_int(books$sheet_name, length))
                              ),
                 "filename" = basename(path),
                 "sheet_data" = tibble::lst(
                   readxl::read_excel(path = path[], 
                                      sheet = sheet_name[])
                   )
             ) %>% 
  dplyr::mutate(id = as.character(row_number()))

Error in switch(ext, xls = "xls", xlsx = "xlsx", xlsm = "xlsx", if (nzchar(ext)) { : 
  EXPR must be a length 1 vector

The code works when not passed a vector for workbook path and sheet name, but obviously the data is not from the correct worksheet in this example below:

sheets <-
  tibble::tibble("sheet_name" = unlist(books$sheet_name),
                 "path" = rep(paths,
                              times = unlist(
                                purrr::map_int(books$sheet_name, length))
                              ),
                 "filename" = basename(path),
                 "sheet_data" = tibble::lst(
                   readxl::read_excel(path = path[1], 
                                      sheet = sheet_name[1])
                   )
             ) %>% 
  dplyr::mutate(id = as.character(row_number()))

dplyr::glimpse(sheets)

Observations: 313
Variables: 5
$ sheet_name <chr> "MLP's", "DJ SP15", "newpower-p...
$ path       <chr> "../data/enron_spreadsheets//ke...
$ filename   <chr> "kenneth_lay__19485__Mlp_1109.x...
$ sheet_data <list> [<# A tibble: 57 x 46,        ...
$ id         <chr> "1", "2", "3", "4", "5", "6", "...

How do I read in the data from many worksheets in many workbooks into a list column in a tibble?

I'm new to reading in messy spreadsheets and using purrr any help or pointers would be appreciated.

score 4 · Accepted Answer · answered Nov 27 '17 at 14:11

Since you mention the purrr package, some other tidyverse packages are worth considering.

dplyr for mutate(), when applying purrr::map() to a column of a data frame and storing the result as list-column.
tidyr for unnest(), which expands a list-column so that each row inside a list-column becomes a row in the overall data frame.
tibble for nicely printed nested data frames

Sample files are needed to demonstrate. This code uses the openxlsx package to create one file containing two sheets (the built-in iris and mtcars datasets), and another file containing three sheets (adding the built-in attitude dataset).

library(openxlsx)

# Create two spreadsheet files, with different numbers of worksheets
write.xlsx(list(iris, mtcars, attitude), "three_sheets.xlsx")
write.xlsx(list(iris, mtcars),           "two_sheets.xlsx")

Now a solution.

First, list the filenames, which will passed to readxl::excel_sheets() for the names of the sheets within each file, and readxl::read_excel() to import the data itself.

(paths <- list.files(pattern = "*.xlsx"))
#> [1] "three_sheets.xlsx" "two_sheets.xlsx"

(x <- tibble::data_frame(path = paths))
#> # A tibble: 2 x 1
#>   path             
#>   <chr>            
#> 1 three_sheets.xlsx
#> 2 two_sheets.xlsx

'Map' the readxl::excel_sheets() function over each of the file paths, and store the results in a new list-column. Each row of the sheet_name column is a vector of sheet names. As expected, the first one has three sheet names, while the second has two.

(x <- dplyr::mutate(x, sheet_name = purrr::map(path, readxl::excel_sheets)))
#> # A tibble: 2 x 2
#>   path              sheet_name
#>   <chr>             <list>    
#> 1 three_sheets.xlsx <chr [3]> 
#> 2 two_sheets.xlsx   <chr [2]>

We need to pass each filename and each sheet name into readxl::read_excel(path=, sheet=), so the next step is to have a data frame where each row gives a path and one sheet name. This is done using tidyr::unnest().

(x <- tidyr::unnest(x))
#> # A tibble: 5 x 2
#>   path              sheet_name
#>   <chr>             <chr>     
#> 1 three_sheets.xlsx Sheet 1   
#> 2 three_sheets.xlsx Sheet 2   
#> 3 three_sheets.xlsx Sheet 3   
#> 4 two_sheets.xlsx   Sheet 1   
#> 5 two_sheets.xlsx   Sheet 2

Now each path and sheet name can be passed into readxl::read_excel(), using purrr::map2() rather than purrr::map() because we pass two arguments rather than one.

(x <- dplyr::mutate(x, data = purrr::map2(path, sheet_name,
                                          ~ readxl::read_excel(.x, .y))))
#> # A tibble: 5 x 3
#>   path              sheet_name data              
#>   <chr>             <chr>      <list>            
#> 1 three_sheets.xlsx Sheet 1    <tibble [150 × 5]>
#> 2 three_sheets.xlsx Sheet 2    <tibble [32 × 11]>
#> 3 three_sheets.xlsx Sheet 3    <tibble [30 × 7]> 
#> 4 two_sheets.xlsx   Sheet 1    <tibble [150 × 5]>
#> 5 two_sheets.xlsx   Sheet 2    <tibble [32 × 11]>

Now each dataset is in a separate row of the data column. We can look at just one of the datasets by subsetting that column.

x$data[3]
#> [[1]]
#> # A tibble: 30 x 7
#>    rating complaints privileges learning raises critical advance
#>     <dbl>      <dbl>      <dbl>    <dbl>  <dbl>    <dbl>   <dbl>
#>  1   43.0       51.0       30.0     39.0   61.0     92.0    45.0
#>  2   63.0       64.0       51.0     54.0   63.0     73.0    47.0
#>  3   71.0       70.0       68.0     69.0   76.0     86.0    48.0
#>  4   61.0       63.0       45.0     47.0   54.0     84.0    35.0
#>  5   81.0       78.0       56.0     66.0   71.0     83.0    47.0
#>  6   43.0       55.0       49.0     44.0   54.0     49.0    34.0
#>  7   58.0       67.0       42.0     56.0   66.0     68.0    35.0
#>  8   71.0       75.0       50.0     55.0   70.0     66.0    41.0
#>  9   72.0       82.0       72.0     67.0   71.0     83.0    31.0
#> 10   67.0       61.0       45.0     47.0   62.0     80.0    41.0
#> # ... with 20 more rows

This method works for your example @nacnudus but due to a putative [bug](https://github.com/tidyverse/readxl/issues/408) in readxl, it does not work for reading in all the Enron workbooks' worksheets. This may be due to some worksheets being empty / containing no data. — mammykins, Nov 28 '17 at 11:21
The real bug is probably in some xls-to-xlsx conversion tool, so we'll probably never know. The issue can be tracked here https://github.com/tidyverse/readxl/issues/408. — nacnudus, Nov 28 '17 at 12:14

score 0 · Answer 2 · answered Dec 15 '17 at 15:11

I just tested this an it worked fine for one workbook.

library(readxl)    
read_excel_allsheets <- function(filename) {
    sheets <- readxl::excel_sheets(filename)
    x <-    lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
    names(x) <- sheets
    x
}

This could be called with:

mysheets <- read_excel_allsheets("foo.xls")

Note, it is for xls and xlsx; it will not work for xlsb files.

Read all worksheets (as dataframes) from multiple Excel workbooks of different structure

2 Answers2