How to load xlsx files efficiently

Question

I trying to import close to 30 xlsx files each of them having size 60mb+ using below

setwd("filepath")
filelist = list.files(pattern = ".*.xlsx")
datalist = lapply(filelist, function(x) {read_excel(x, sheet = 1)})
for (i in 1:length(datalist)){datalist[[i]]<-cbind(datalist[[i]],filelist[i])} 
Data = do.call("rbind", datalist)

It takes ages for me to load these files, is there any efficient way to load these files within few minutes? FYI my system RAM is 8GB, any help would be highly appreciated.

This might be useful https://stackoverflow.com/a/48105838/786542 — Tung, Sep 10 '18 at 16:34
If it takes a long time to load the file try switching the read_excel with `openxlsx::read.xlsx` and what is the purpose of the for loop? — Mike, Sep 10 '18 at 17:17
In `read_excel` you need to set `col_types`, this will speed things up significally. But if your excel-data is not tidy, you're probably stuck with larger loading times.. — Wimpel, Sep 10 '18 at 17:41
@Mike read.xlsx is giving me error "java.lang.OutOfMemoryError: GC overhead limit exceeded", I am trying to include filename as a separate column through for loop, it's working fine for small files — Saibal Dutta, Sep 11 '18 at 04:40
If you restart your session and then run this code `options(java.parameters = "-Xmx4g" )` first it should get rid of that error. — Mike, Sep 11 '18 at 13:20
@Mike I tried running options(java.parameters = "-Xmx2048m") before I run my codes, and then codes a bit like this - filelist<-list.files(path ='.', recursive = TRUE, pattern = "\\.xlsx", full.names = TRUE) Data<-rbindlist(sapply(filelist, function(x) read.xlsx(x, sheet = 1, colNames = TRUE), simplify = FALSE),use.names = TRUE, idcol = "FileName")....this loads 4 files in 9 minutes....can this be more faster?? — Saibal Dutta, Sep 11 '18 at 13:54
like @wimpel said, if your data is not tidy and or you have very large datasets, you might just be stuck with long loading times. Hopefully you are able to clean your data and save a copy so that it will load faster in the future. — Mike, Sep 11 '18 at 14:13
@SaibalDutta I do not see you setting (and using) `col_types` in your code.. this speeds things up for me (big time!!). In not sure, set all to "text" and do your type-conversions in R. — Wimpel, Sep 11 '18 at 15:16

score 0 · Answer 1 · answered Oct 03 '22 at 21:59

You could consider using parallel calculations. Here is an example :

library(parallel)
library(doParallel)

nb_CPU <- 4
cluster <- parallel::makeCluster(nb_CPU)
doParallel::registerDoParallel(cl = cluster)

setwd("filepath")
filelist <- list.files(pattern = ".*.xlsx")

fn_Par <- function(x)
{
  library(readxl)
  read_excel(x, sheet = 1)
}

datalist <- parLapply(cl = cluster, X = filelist, fun = fn_Par)

for(i in 1 : length(datalist))
{
  datalist[[i]] <- cbind(datalist[[i]], filelist[i])
}

Data <- do.call("rbind", datalist)

How to load xlsx files efficiently

1 Answers1