0

I'm writing code to:

  • read a list of folders

  • sort and extract folders with certain text elements from the list

  • get full file names from each folder

  • find the tabs/sheets inside each file

  • loop/lapply read.xlsx() over my nested list of files

The ultimate goal is to read all tabs/sheets from their respective files from their respective folders, while creating columns to identify what tab/sheet they're from and what folder they are from, and then merging them all in one go.

My preferred package for reading in excel files is 'openxlsx'.

Here's my code for getting folders and files:

path<- "/Users/jackserna/Google Drive/Folder"
dataFolders<- list.files(path)
dataFolders<- sort(dataFolders[starts_with(match = "FY", vars = dataFolders)])
files<- lapply(lapply(dataFolders, FUN = function(x){
     paste(path,x,sep = "/")
     }), FUN = function(x){
          list.files(x, pattern = "*.xlsx", full.names = TRUE)
 })

I have been unable to loop/apply a read function for all my files and all my sheets. I would read no more than 1 folder, which would have to be repeated. I used some code from this post...

data.to.merge <- lapply(lapply(files[[1]], FUN = function(x){
     read.xlsx(x, sheet = 3, cols = 1:5)
}), na.omit)
merged.daata <- Reduce(function(...) merge(..., all = T), data.to.merge) 

However, this approach will not allow me to append sheet names as an extra column for each sheet I read in. This approach assumes that there is data on sheet #3, but to my dismay that is not the case with these files. Data is scattered across sheets, and some sheets must be ignored in order to merge.

In an attempt to grab all the sheets and parse out the sheets I don't want, here's what was done:

allsheets<- list()
for(i in files){
  for(j in i){
    sheets<- getSheetNames(j)
    allsheets<- cbind(allsheets,sheets)
  }
}

But this has turned into a nightmare to be able to use for reading and merging.

How can I get R to understand what I'm trying to accomplish?

Jack Serna
  • 31
  • 1
  • 9

3 Answers3

1

I haven't been able to solve this particular data cleaning nightmare, but in preparing for the next year I have made some changes to the file structure. I have found the data.table package to be really useful for bringing nested lists into something like a dataframe.

Notice, below, that all data is read from only 1 specified sheet name (or number).

### Get list of folder names
dataFolders<- list.files(path = path)
dataFolders<- sort(dataFolders[starts_with(match = "FY", vars = dataFolders)])

### Get list of files for each regional folder
files<- lapply(lapply(dataFolders, FUN = function(x){
     paste(path,x,sep = "/")}), FUN = function(x){
          list.files(x, pattern = "*.xlsx", full.names = TRUE)
})

dataPrep<- lapply(files, FUN = function(x){
     lapply(lapply(x, FUN = function(x){
          read.xlsx(x, sheet = "Sheet Name", cols = 1:6)}), na.omit)
})
transform<- rbindlist(lapply(dataPrep, FUN = function(x){
     rbindlist(x, use.names = TRUE)
}), use.names = TRUE) 

There are no nightmare solutions, but it should be the standard of data entry.

Jack Serna
  • 31
  • 1
  • 9
0

I would be discrete about your separate functions so you can chase down where things are going wrong and also, to give R a chance to clean up after your Excel package which WILL be memory intensive. If you push everything into one command it will bog down and crash.

Once you read the sheet_list into memory using your favorite package, this should do it:

for (i in sheet_list({
new_sheet<- your_read_command(sheet_list[i])
     #subsitute using whatever reader package command you use likexlConnect or xlsx
new_sheet$sheetname<- i #(adding the column with sheet name prior to binding)
data<-rbind(data, new_sheet)
rm(new_sheet)
gc()}`

The rm() & gc() are MASSIVELY important here. You will find that ALL Excel readers are JARS based and HOG memory crashing your system. Remove each object after its data is bound and then garbage collect to keep from crawling to a halt or flat-out crashing!

I tested the parts with a CSV and it works...you just have to use an Excel package now and possibly tweak a bit. If you need help post a comment...

OH and...create the data<-data.frame()` and assign it all your column names before you go into the loop...this will give the loop something to bind onto!

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • I'm trying out this loop: `data<- setNames(data.frame(matrix(ncol = 207, nrow = 0)), column_names) wksheets <- list.files(recursive=T, pattern='*.xlsx') for (i in wksheets){ new_file<- loadWorkbook(wksheets[i]) sheetnames<- getSheets(new_file) sheetnames<- sheetnames[3:length(sheetnames)] for (j in sheetnames){ sheet_list<- readWorkbook(sheetnames[j], startRow = 2, colNames = FALSE) sheet_list$Reading.Center<- j} data<- rbind(data, sheet_list) rm(sheet_list) rm(sheetnames) gc() } ` But I'm receiving a vague error `Error: NullPointerException (Java):` – Jack Serna Feb 23 '17 at 22:14
  • Try following this solution:http://stackoverflow.com/questions/15825004/xlconnect-loadworkbook-error-poixmlexception-java – sconfluentus Feb 24 '17 at 02:01
0

I can additionally recommend the package readxl. It probably can't detect the number of sheets inside a worksheet, but you can use it after finding the sheets with XLConnect.

library(data.table)
my.monthly.excel.files <-  "../../../../../../Documents/Output/Monthly/"
my.file.list <- data.table(File.Name = list.files(path = my.monthly.excel.files , pattern = ".xlsx$" , recursive = FALSE, include.dirs = FALSE))
rbindlist(lapply(my.file.list[,File.Name], function(file){print(file); read_excel(path = paste0(my.monthly.excel.files, file), sheet = "sheet_name")}))

In order to get rid of the warnings I defined the column type for each column to be imported as either being text, numeric or date.

hannes101
  • 2,410
  • 1
  • 17
  • 40
  • I've tried openxlsx, XLconnect, and readxl, but with frustration in deprecated functions. XLconnect seems to be working more consistently. For everybody's information, I'm using R version 3.3.2, openxlsx version 4.0.0, XLconnect version 0.2-12, and readxl version 0.1.1. Please don't use deprecated functions like data.table or rbindlist unless you specify which version you're running. I thought you meant tbl_df(), but that doesn't make sense, although with your syntax "my.file.list[,File.Name]" it does make sense. As for the output I'm receiving: `Error: corrupt data frame` – Jack Serna Feb 23 '17 at 21:49
  • Hmm, the two commands `data.table` and `rbindlist` are part of the data.table package. Added the call for that now, since they are not deprecated. – hannes101 Feb 24 '17 at 06:30