0

I asked this question before but it didn't really go anywhere. I've done more work on it but again I am stuck!

I have a spreadsheet with two tabs, one has 3 cells I am interested in (A2, A4, A6) which are identifying details, and the second tab has a 4X4 grid (A1: D4) which has some financial information.

I can make a data frame, and I can locate the data, and to a certain extent I can extract the data. My issue is looping the whole thing over all the files in the folder, and taking the data and applying it to the pre-created data frame.

Code below for your reference

Locating files

  list.files(
    path = "C:/Excel Files",
    pattern = '*.xlsx|*.XLSX',
    full.names = FALSE,
    recursive = FALSE
  ) 

Creating df

    colnames <- c( A2, A4, A6, A1, B1, C1, D1, A2, B2, C2, D2, A3, B3, C3, D3, A4, B4, C4, D4)

    output <- matrix(NA,nrow = length(file.list), ncol = length(colnames), byrow = FALSE)
    colnames(output) <- c(colnames)
    rownames(output) <- c(file.list)

extracting data

    FirmData1 <- readxl::read_xlsx("N:/Excel Files/test.xlsx", sheet = 2, range = "A1:D1", na = "", col_names = FALSE, col_types = "text")
    FirmData2 <- readxl::read_xlsx("N:/Excel Files/test.xlsx", sheet = 2, range = "A2:D2", na = "", col_names = FALSE, col_types = "text")
    FirmData3 <- readxl::read_xlsx("N:/Excel Files/test.xlsx", sheet = 2, range = "A3:D3", na = "", col_names = FALSE, col_types = "text")
    FirmData4 <- readxl::read_xlsx("N:/Excel Files/test.xlsx", sheet = 2, range = "A4:D4", na = "", col_names = FALSE, col_types = "text")

    FirmData <-  dplyr:: bind_rows(FirmData1, FirmData2, FirmData3, FirmData4)
    FirmData <- t(FirmData)
    colnames(output)

    Firm <- dplyr:: bind_rows(FirmInfo, FirmData) %>%
      tidyr:: spread(key = Field, value = Value)

loop

there is no loop!

M--
  • 25,431
  • 8
  • 61
  • 93
  • Can you clarify what you are looking for. Do you just want the ability to loop over a vector of strings (e.g. c("A1:D1", "A2:D2", "A3:D4", ...) and have a dataframe created as an output? – sdhaus Feb 06 '20 at 16:04
  • Your use of `colnames <- c(A2, A4, ...)` suggests either (1) you have variables named `A1`, etc, in your environment, in which case this is not reproducible; or (2) your code is not real, so this is (again) not reproducible. Please ensure the code you give us is syntactically-correct, valid code. – r2evans Feb 06 '20 at 16:04
  • You call `read_xlsx` four times with ranges that are each a row but could be condensed into `"A1:D4"`. Is there a reason you need the inefficiency of opening it four times vice once? – r2evans Feb 06 '20 at 16:08

1 Answers1

0

Here's a way to loop them all together.

I'll first create a spreadsheet to work on. I'm using openxlsx, but this only required to create the file, not read it in (I'll still use readxl for that).

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "FirstSheet")
openxlsx::writeDataTable(wb, "FirstSheet", data.frame(t(outer(c("A","B"), 1:6, paste0))), colNames = FALSE)
openxlsx::addWorksheet(wb, "SecondSheet")
openxlsx::writeDataTable(wb, "SecondSheet", mtcars[1:4, 1:4], colNames = FALSE)
openxlsx::saveWorkbook(wb, "quux.xlsx")

readxl::read_xlsx("quux.xlsx", "FirstSheet", range = c("A2:A6"), col_names = "A")
# # A tibble: 5 x 1
#   A    
#   <chr>
# 1 A2   
# 2 A3   
# 3 A4   
# 4 A5   
# 5 A6   
readxl::read_xlsx("quux.xlsx", "SecondSheet", range = c("A1:D4"), col_names = LETTERS[1:4])
# # A tibble: 4 x 4
#       A     B     C     D
#   <dbl> <dbl> <dbl> <dbl>
# 1  21       6   160   110
# 2  21       6   160   110
# 3  22.8     4   108    93
# 4  21.4     6   258   110

First, show what we want to do for each file:

fn <- "quux.xlsx"
first  <- readxl::read_xlsx(fn,  "FirstSheet", range = "A2:A6", col_names = "A")
second <- readxl::read_xlsx(fn, "SecondSheet", range = "A1:D4", col_names = LETTERS[1:4])
data.frame(matrix(first$A[c(1,3,5)], nrow = 1), stringsAsFactors = FALSE)
#   X1 X2 X3
# 1 A2 A4 A6
data.frame(matrix(t(second), nrow = 1))
#   X1 X2  X3  X4 X5 X6  X7  X8   X9 X10 X11 X12  X13 X14 X15 X16
# 1 21  6 160 110 21  6 160 110 22.8   4 108  93 21.4   6 258 110

Granted, the names are boring, but that's just an aesthetic that can be remedied with colnames.

Now, let's lapply all of that, and combine the results into one frame.

filelist <- c("quux.xlsx", "quux.xlsx", "quux.xlsx")
datlist <- lapply(filelist, function(fn) {
  first  <- readxl::read_xlsx(fn,  "FirstSheet", range = "A2:A6", col_names = "A")
  second <- readxl::read_xlsx(fn, "SecondSheet", range = "A1:D4", col_names = LETTERS[1:4])
  cbind(
    data.frame(matrix(first$A[c(1,3,5)], nrow = 1), stringsAsFactors = FALSE),
    data.frame(matrix(t(second), nrow = 1))    
  )
})
out <- do.call(rbind, datlist)
out
#   X1 X2 X3 X1 X2  X3  X4 X5 X6  X7  X8   X9 X10 X11 X12  X13 X14 X15 X16
# 1 A2 A4 A6 21  6 160 110 21  6 160 110 22.8   4 108  93 21.4   6 258 110
# 2 A2 A4 A6 21  6 160 110 21  6 160 110 22.8   4 108  93 21.4   6 258 110
# 3 A2 A4 A6 21  6 160 110 21  6 160 110 22.8   4 108  93 21.4   6 258 110

Side note:

Your use list.files is to me a little odd, perhaps you have reason. I tend to always use full.names=TRUE, because I need it to be agnostic to my working directory. Especially you set the path to what could easily be not the working directory, to have to then paste the directory back with the filename when reading the files in seems unnecessary. Also, though minor, your pattern is probably good, but if somebody ever creates a wonky file named quux.XlSx (mixed case), you won't see it; it is more permissive to use ignore.case=TRUE.

I suggest

filelist <- list.files(
  path = "C:/Excel Files",
  pattern = '*.xlsx',
  ignore.case = TRUE,
  full.names = TRUE,
  recursive = FALSE
) 
r2evans
  • 141,215
  • 6
  • 77
  • 149