Reading in multiple excel files, adding a column, then binding

Question

I have a series of Excel files that I want to read into R, add in a date column based on the file name, then bind together.

The naming convention of the files is User_Info_Jan, User_Info_Feb, User_Info_Mar. The month is only referenced in the name of the file and not actually mentioned in the actual file itself. An example of what the User_Info_Jan files looks like:

ID   Name
ABC  Joe Smith
DEF  Henry Cooper 
ZCS  Kelly Ma

Is there a way I can read the files in using the pattern in the file name (pattern = User_Info_), then add a column called "Month" indicating what month the file is for, before binding together?

Sample Data frame after month column:

ID   Name           Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January

Sample data frame after binding together:

ID   Name           Usage Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January
KFY  Lisa Schwartz  February
LFG  Alex Shah      March

Joel Kandiah · Answer 1 · 2021-04-06T16:04:38.743

0

I would use the map() function to solve this problem from the purrr library.

Without a reproducible format as we are reding in files an example from my recent code is as follows:

# Get all the filenames (I assume this contains the month data in  your case)
GravfilesMap <- list.files("GravityModel/MapOut", full.names = T)

GravMap <-
  # Use Regex to select the string for the month for you I would try "_([a-zA-Z]+).xlsx" passed to the str_match function (this gets the month names as a column)
  (GravfilesMap %>% str_match("(\\d+).csv$"))[,2] %>%
  # Convert to a data frame
  tibble %>% 
  # For each file_name read in the data to its own data frame (this will give on each row a month name and then a nested dataframe)
  # I have used read_csv here you will use something like read_xls
  # The order of the files is the same as the order of our months as we are importing them in the order specified by the list
  mutate(file_contents = map(GravfilesMap, ~read_csv(., col_names = F)))

# Unnest the dataframes to appear in the form that was requested
GravMap <- GravMap %>% unnest()

Detail on a similar method can be found at https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/

edited Apr 06 '21 at 16:04

answered Apr 06 '21 at 15:37

Joel Kandiah

1,465
5
15

It's far better to use `list.files(..., full.names=TRUE)` than to not use that and then `paste0` the original path to it. While the results are usually the same, it does have specific fail-paths that are not always intuitive (e.g., `length(list.files(..,full.names=F))` is 0, but `length(GravfilesMap)` will be 1 here). Another point: `\\d` matches digits, so will not find `"Jan"`, etc. – r2evans Apr 06 '21 at 15:43
1

I have attempted to update the answer accordingly. – Joel Kandiah Apr 06 '21 at 15:47
1

For discussion: I've read and referenced that blog post several times, I wish Claus had not done a couple of things that are many times problematic: (1) use of `require` instead of `library`, which there encourages bad habits (https://stackoverflow.com/a/51263513/3358272); (2) `dir(.)` without `full.names=TRUE` has been the source of many problems here on SO, I think the default value of false makes no sense in defensive programming. Even with that, it's a good blog post, and his 2019 update was well warranted. – r2evans Apr 06 '21 at 15:57
btw, you recommend `map_chr` but have since edited it out (on my recommendation) – r2evans Apr 06 '21 at 16:03
1

Thanks you're quite right if anything my map function should have been `map_dfr` and then simplify the code. – Joel Kandiah Apr 06 '21 at 16:05

r2evans · Accepted Answer · 2021-04-06T17:56:19.807

I'll demonstrate with fake filenames, but the real commands I suggest you run are commented out with the same structure. I'm assuming .xlsx for "excel files", but this works equally well with .csv (just update the pattern).

# files <- list.files(path = ".", pattern = "User_Info_.*\\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\\.xlsx", files, list(month = ""))
monthnames
#   month
# 1   Jan
# 2   Feb
# 3   Mar

At this point, we've extracted the month name from each filename. I find strcapture (in base R) better than gsub, as the latter returns the entire string if there are no matches; another alternative in base R is regmatches(files, gregexpr(...)), but that seems a bit more complicated than it needs to be here. Another alternative is stringr::str_extract which might be more intuitive if you're already using stringr and/or other tidyverse packages.

From here, we can iterate over the files to read them in.

# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4   Jan
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   Jan
# $Feb
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Mazda RX4        21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   Feb
# Pontiac Firebird 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2   Feb
# $Mar
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

Combining that list-of-frames into a single frame is direct:

do.call(rbind, out)
#                        mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4   Jan
# Jan.Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Jan
# Feb.Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   Feb
# Feb.Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2   Feb
# Mar.Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Mar.Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

An alternative to all of that can use data.table::rbindlist or dplyr::bind_rows, and assign the "id" column directly:

# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)

data.table::rbindlist(out, idcol = "month")
#     month   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:    Jan  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
# 2:    Jan  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
# 3:    Feb  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
# 4:    Feb  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
# 5:    Mar  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# 6:    Mar  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1

dplyr::bind_rows(out, .id = "month")
#                   month  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Chrysler Imperial   Jan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# Hornet Sportabout   Jan 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Mazda RX4           Feb 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Pontiac Firebird    Feb 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Merc 280            Mar 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Hornet 4 Drive      Mar 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

The latter two work because when I called Map earlier, the first argument (monthnames$month) passed to the inner function is used as the names for the list output, which is why you see $Jan etc as the elements of the returned list. Both rbindlist and bind_rows use those names as "id" columns when idcol=/.id= are used. (If no "names" are actually present, both functions count along them.)

Thanks for your response! Do you know why I get the error Error: 'read_excel' is not an exported object from 'namespace:readr', when I try to use the map function in your second step? — Alice Wang, Apr 06 '21 at 17:30
My bad ... `readxl::read_excel`. Fixed. BTW, if you are using `csv`, then use `read.csv` or `readr::read_csv` or similar (I wasn't certain which format you're actually using). — r2evans, Apr 06 '21 at 17:56

score 0 · Answer 3 · answered Apr 06 '21 at 16:02

You can try with purrr package like this:

files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
months <- c("Jan","Feb","Mar")

library(openxlsx)
library(purrr)
map2_dfr(files,months,function(x,y) read.xlsx(x) %>% mutate(Month=y))

Reading in multiple excel files, adding a column, then binding

3 Answers3

Linked