I understand that readxl
can be used to read in multiple worksheets from a workbook. However, I am struggling to extend this and vectorise this across many workbooks with different sheet names and number of sheets and data therein.
I demonstrate using the Enron spreadsheet data which is a bunch of .xlsx files that I downloaded.
head(list.files("../data/enron_spreadsheets/"), 3)
[1] "albert_meyers__1__1-25act.xlsx"
[2] "albert_meyers__2__1-29act.xlsx"
[3] "andrea_ring__10__ENRONGAS(1200).xlsx"
To make it manageable, we sample.
# Set the path to your directory of Enron spreadsheets here
enron_path <- "../data/enron_spreadsheets/"
# Set the sample size for testing here
sample_size <- 100
all_paths <- list.files(enron_path,
full.names = TRUE)
# For testing, look at n (sample_size) random workbooks.
set.seed(1337)
sample_paths <- sample(all_paths, sample_size)
paths <- sample_paths
Inspecting these workbooks and counting the number of worksheets therein reveals they have different number of sheets and contain different data.
# purr package
# https://jennybc.github.io/purrr-tutorial/index.html
sheet_count <- purrr::map(paths, readxl::excel_sheets) %>%
purrr::map(length) %>%
unlist()
hist(sheet_count, main = "")
However, to load all the sheets in a workbook into a list of data frames, we need to:
- Get worksheet names as a self-named character vector (these names propagate nicely).
Use
purrr::map()
to iterate sheet reading.books <- dplyr::data_frame(filename = basename(paths), path = paths, sheet_name = purrr::map(paths, readxl::excel_sheets) ) %>% dplyr::mutate(id = as.character(row_number())) books # A tibble: 100 x 4 filename <chr> 1 kenneth_lay__19485__Mlp_1109.xlsx 2 kate_symes__18980__SP 15 pages.xls 3 chris_germany__1821__newpower-purc 4 john_griffith__15991__Forwards Det 5 jane_tholt__13278__bid2001A.xlsx 6 gerald_nemec__11481__EOLfieldnames 7 stacey_white__39009__Power RT Serv 8 eric_saibi__9766__012302.xlsx 9 david_delainey__8083__ENA Status o 10 daren_farmer__5035__HPLN0405.xlsx # ... with 90 more rows, and 3 # more variables: path <chr>, # sheet_name <list>, id <chr>
Here we have one row per workbook in books
with the workbook's worksheet names stored in a list column. We want one row per worksheet with the data contents of the worksheet stored in a list column so that we can add extra features based on the worksheets data (the worksheet is the experimental unit). The problem is it doesn't vectorise as expected, am I missing something?
This errors...
sheets <-
tibble::tibble("sheet_name" = unlist(books$sheet_name),
"path" = rep(paths,
times = unlist(
purrr::map_int(books$sheet_name, length))
),
"filename" = basename(path),
"sheet_data" = tibble::lst(
readxl::read_excel(path = path[],
sheet = sheet_name[])
)
) %>%
dplyr::mutate(id = as.character(row_number()))
Error in switch(ext, xls = "xls", xlsx = "xlsx", xlsm = "xlsx", if (nzchar(ext)) { :
EXPR must be a length 1 vector
The code works when not passed a vector for workbook path and sheet name, but obviously the data is not from the correct worksheet in this example below:
sheets <-
tibble::tibble("sheet_name" = unlist(books$sheet_name),
"path" = rep(paths,
times = unlist(
purrr::map_int(books$sheet_name, length))
),
"filename" = basename(path),
"sheet_data" = tibble::lst(
readxl::read_excel(path = path[1],
sheet = sheet_name[1])
)
) %>%
dplyr::mutate(id = as.character(row_number()))
dplyr::glimpse(sheets)
Observations: 313
Variables: 5
$ sheet_name <chr> "MLP's", "DJ SP15", "newpower-p...
$ path <chr> "../data/enron_spreadsheets//ke...
$ filename <chr> "kenneth_lay__19485__Mlp_1109.x...
$ sheet_data <list> [<# A tibble: 57 x 46, ...
$ id <chr> "1", "2", "3", "4", "5", "6", "...
How do I read in the data from many worksheets in many workbooks into a list column in a tibble?
I'm new to reading in messy spreadsheets and using purrr
any help or pointers would be appreciated.