1

I am trying to batch read in Excel files into R, but I need R to skip and continue the loop if the sheet name in question is missing.

library(xlsx)    
file.list <- list.files(recursive=T,pattern='*.xlsx')  #get files list from folder
for(i in 1:length(file.list))
{
  filename=file.list[i]
  data=read.xlsx(file = filename, sheetName="Income Statement", startColumn=2, startRow=4, header = T)
  assign(x = filename,value = data)
}

I assume I need to add an if else statement within this but I am not sure. Any advice appreciated!

StatsGirl
  • 13
  • 2

2 Answers2

0

Here is an option using an if():

library(xlsx)

file.list <- list.files(pattern = '*.xlsx', recursive = TRUE)

for(i in 1:length(file.list)) {
  filename <- file.list[i]
  sheets <- getSheets(loadWorkbook(filename))
  if("Income Statement" %in% names(sheets)) {
    data <- read.xlsx(file = filename, 
                      sheetName = "Income Statement", 
                      startColumn = 2, 
                      startRow = 4, 
                      header = TRUE)
    assign(x = filename, value = data)
  }
}

Created on 2021-03-03 by the reprex package (v1.0.0)

the-mad-statter
  • 5,650
  • 1
  • 10
  • 20
0

The accepted answer provides a sufficient solution.

However, if the number of files that are required to read is larger, it might be worth to avoid for loops altogether and rewrite the solution using lapply.

Few additional tweaks could be added:

  • You could use data.table rbindlist to bind each file into a single big data.frame / data.table (to avoid multiple data.frames in global environment). This assumes that all loaded data share the structure - this is more concise. This can be also achieved in base with do.call(rbind, "yourlist") but you will loose the "id" column.
  • You can implement some sort of percentage counter to show how many files were loaded using cat(round(i / length(test) * 100, 3), "% \r") which is also noted here.
  • There's number of packages that can be used to do this task. I would recommend openxlsx and avoid the xlsx. The openxlsx removes the dependency on Java. There's also readxl which is worth mentioning. The solution below is using openxlsx

The following solution should achieve similar results but should be faster than a for loop. As no data example was provided, you will have to play with some of the arguments in openxlsx::read.xlsx to specify the range of the sheet you want to load.

test <- list.files(recursive = TRUE, pattern = "*.xlsx")

as.data.frame(data.table::rbindlist(lapply(1:length(test), function(i){
        cat(round(i / length(test) * 100, 3), "%    \r")
        tmp_name <- test[i]
        tmp_sheets <- openxlsx::getSheetNames(tmp_name)
        if ("Income Statement" %in% tmp_sheets) {
                tmp_data <- openxlsx::read.xlsx(
                        tmp_name, 
                        sheet = "Income Statement", 
                        startRow = 4)
                assign(x = tmp_name, value = tmp_data)
        }
        
}
), idcol = "id"))
gofraidh
  • 673
  • 8
  • 26