23

It is a follow-up question to this one. What is the fastest way to read .xlsx files into R ?

I use library(xlsx) to read in data from 36 .xlsx files. It works. However, the problem is that this is very time consuming (well over 30 minutes), especially when considering the data in each file is not that large (matrix of size 3*3652 in each file). To this end, is there a better to deal with such problem, please? Is there another quick way to read .xlsx into R? Or can I put the 36 files into a single csv file quickly and then read into R?

Moreover, I just realised that readxl cannot write xlsx. Is there a counterpart of it to deal with writing instead of reading?

"Response to those voted this question down":

This question is about fact instead of the so-called "opinionated answers and spam" because speed is time and time is fact but NOT opinion.

Further update:

Perhaps one can explain to us in plain language why some method works much faster than others. I am certainly confused about this.

engineerchange
  • 419
  • 1
  • 6
  • 22
LaTeXFan
  • 1,136
  • 4
  • 14
  • 36
  • 6
    use `readxl::read_excel()`, it is usually faster – scoa Jun 14 '17 at 07:31
  • 4
    Either try the `openxlsx` or the `readxl`package. – Jaap Jun 14 '17 at 07:31
  • 5
    This is a perfectly reasonable question just as we've seen other questions in SO about what is the fastest way to read or write `.csv` files in R. The answer to the questions would require a benchmark and it could be of interest to a large audience – rafa.pereira Jun 14 '17 at 07:59

2 Answers2

34

Here is a small benchmark test. Results: readxl::read_xlsx on average about twice as fast as openxlsx::read.xlsx across different number of rows (n) and columns (p) using standard settings.

enter image description here

options(scipen=999)  # no scientific number format

nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
pp <- c(1, 5, 10, 20, 30, 40, 50)

# create some excel files
l <- list()  # save results
tmp_dir <- tempdir()

for (n in nn) {
  for (p in pp) {
    name <-  
    cat("\n\tn:", n, "p:", p)
    flush.console()
    m <- matrix(rnorm(n*p), n, p)
    file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")

    # write
    write.xlsx(m, file)

    # read
    elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
    df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

    elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
    df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

  }
}

# results 
d <- do.call(rbind, l)

library(ggplot2)

ggplot(d, aes(n, elapsed, color= fun)) + 
  geom_line() + geom_point() +  
  facet_wrap( ~ paste("columns:", p)) +
  xlab("Number of rows") +
  ylab("Seconds")
Mark Heckmann
  • 10,943
  • 4
  • 56
  • 88
  • 2
    Beautiful charts. Great answer! For my part, using xlsx::write.xlsx() on a 49MB dataframe crashed out-of-memory after 10 minutes, and this was with `options(java.parameters = "- Xmx1024m")`. The same dataframe saves in under 30 seconds using openxlsx::saveWorkbook() – pbnelson May 17 '18 at 21:06
  • 1
    Just to point out, `readxl::read_xlsx` is faster than `openxlsx::read.xlsx` but it seems to use more RAM for "large" datasets at least (373255 rows, 74 columns in my case). My computer cannot lift the data, so I have to move to `openxlsx` that takes more time, but works fine (sadly, I can convert the data to csv before hand). I wonder if is just me or is something that happens to more people – Álvaro May 27 '20 at 23:59
  • 3
    `readxl` might be faster, but with significant costs: it only analyzes the first 1000 rows to determine the column types. This is suboptimal, because (having just been bitten myself), sometimes the relevant cells are well below that number, especially for large datasets. I will stick with `openxlsx`, which has no such limitation. – Adrian Aug 30 '22 at 19:36
  • 1
    @pbnelson. I have also got same issue with write.xlsx getting memory allocation failed error. So had to use openxlsx::saveWorkbook() – Golem Aug 17 '23 at 01:08
4

To write an excel file, readxl has a counterpart called writexl. As far as what is the best package to read an excel file, I think the benchmark provided above is pretty good.

The only reason I would use xlsx to write a package would be if I were to write many excel sheets in one .xlsx file.

Shawn Brar
  • 1,346
  • 3
  • 17