Fastest way to read large Excel xlsx files? To parallelize or not?

Question

My questions are:

What is the fastest way to read large(ish) .xlsx Excel files into R? 10 to 200 MB xlsx files, with multiple sheets.
Can some kind of parallel processing be used, e.g. each core reading a separate sheet of a multi-sheet Excel file?
Is there any other kind of optimisation that can be performed?

What I have understood (and what I haven't) so far:

if reading from spinning disks, as I will, parallel processing may actually slow down the reading as multiple processes try to read from the same file. However, parallel process may help with things like converting and inferring data types? Not sure how much readxl spends reading from disk (which I assume is IO bound) vs converting data types (which I guess is CPU bound).
This may be different with SSD drives. I might copy data to an SSD drive and read from there if there's a massive improvement.
data.table::fread speeds up the reading of text files (although I don't fully understand why) but it cannot be used for excel files - or can it?
I understand from this answer that readxl tends to be faster than openxlsx

I am interested in tabular data only; I am not interested in the Excel formatting, nor in charts, text labels or any other kind of data.

I am possibly looking to import into tidyverse tibbles, but not necessarily. I will then need to export the tables into a Microsoft SQL Server.

Some background: I mostly use Python and am totally new to R. Reading large Excel files in Python is painfully slow. I have already seen that R's readxl is much faster than Python's pandas (on a 15-sheet xlsx, each sheet with 10,000 rows and 32 columns: 5.6 seconds for readxl vs 33 seconds for pandas), so that's great! I would, however, still like to understand if there is any way to make the import even faster. I can read the files with R, export them to SQL, then continue the rest of my workflow with Python reading from SQL.

I don't think converting to CSV is the best option, especially not when readxl is so much faster than Python anyway; basically converting to csv may easily take longer than the time I'd save by reading from csv rather than excel. Plus, at least with Python (I don't really know enough R to have tested this thoroughly with readxl), inferring data types works much better with xlsx than with csv.

My code (any critique or suggestion is more than welcome):

library(readxl)
library(tidyverse)
library(tictoc)


this.dir <- dirname(parent.frame(2)$ofile)
setwd(this.dir)

tic("readxl")

path <- "myfile.xlsx"
sheetnames <- excel_sheets(path)
mylist <- lapply(excel_sheets(path), read_excel, path = path)

names(mylist) <- sheetnames
toc()

I thought fread only works with text files, not xlsx. Or did I get this wrong? — Pythonista anonymous, Apr 04 '19 at 10:10
I have also noticed that exporting the same 58 MBs of data to SQL is much faster in R than in Python: 4.25 vs 18.4 seconds — Pythonista anonymous, Apr 04 '19 at 11:30

clemens · Accepted Answer · 2019-04-04T12:34:13.663

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

Make a cluster with a specified number of cores:

cl <- makeCluster(detectCores() - 1)

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

parLapplyLB(cl, sheets, function(sheet, excel_path) {
  readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

You can use the package microbenchmark to test how fast certain options are:

library(microbenchmark)

microbenchmark(
  lapply = {lapply(sheets, function(sheet) {
    read_excel(excel_path, sheet = sheet)
  })},
  parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_excel(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

In my case, the parallel version is faster:

Unit: milliseconds
     expr       min        lq     mean    median        uq      max neval
   lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
 parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10

The test file contains of 6 sheets, each containing this table:

    test test1 test3 test4 test5
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
11    11    11    11    11    11
12    12    12    12    12    12
13    13    13    13    13    13
14    14    14    14    14    14
15    15    15    15    15    15

Note: you can use stopCluster(cl) to shut down the workers when the process is finished.

what does load balancing do and why do we need it? I understand `parLapplyLB` does load balancing and `parLapply` doesn't; using the latter produces warnings about closing unused connections, but still reads the data in what seems to be the same time. — Pythonista anonymous, Apr 04 '19 at 12:14
If you have 4 cores and want to read in 8 sheets, that all take a different amount of time, you would start to read in 4 sheets, wait until all are done, and then move on to the other 4. With load balancing, the processes don't wait for each other, one core would work on the first sheet, while the others continue to the next sheets. — clemens, Apr 04 '19 at 12:32
Thanks a lot, you're a star! So in my case it probably takes the same time because all the sheets had the same size; in other circumstances, load balancing can make a difference. — Pythonista anonymous, Apr 04 '19 at 12:42

score 3 · Answer 2 · edited May 19 '21 at 05:17

I saw @clemens answer but since I have prepared some stuffs, so I am posting it anyway. In addition to @clemens answer, I use a bigger test data, and run simpler multicore option using furrr::future_map() which does not give any performance gain in the end...

Data generation

This will create 10 sheets of 10000 * 15 data with mixture of float, int, and character. On my disk, the filesize is 13.2MB.

library(writexl)
library(tidyverse)

n <- 1e4
sample_data <- map(seq(10), function(x) {
  sample_data <-
    map(1:5, function(x){
      data_frame(
        num_var = rnorm(n),
        int_var = as.integer(sample(1e5:9e5, n, replace = T)),
        char_var = sample(letters, n, replace = T)
      ) %>% rename_all(funs(paste0(., x)))
    }) %>% bind_cols()
  return(sample_data)
})
fn <- tempfile(tmpdir = "~/Desktop/temp",fileext = ".xlsx")
write_xlsx(sample_data, path = fn)

Benchmark

The parallel part is borrowed from @clemens.

library(parallel)
library(readxl)
library(purrr)

sheets <- excel_sheets(fn)

cl <- makeCluster(detectCores() - 1)
excel_path <- fn

microbenchmark::microbenchmark(
  map = map(sheets, function(x) read_xlsx(fn, sheet = x)) ,
  future_map = furrr::future_map(sheets, function(x) read_xlsx(fn, sheet = x)),
  parLapplyLB = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_xlsx(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

The benchmark result looks like this:

Unit: milliseconds
        expr       min        lq      mean    median       uq      max neval
         map 1258.2643 1272.2354 1351.8371 1291.2474 1430.211 1556.992    10
  future_map 1276.4125 1302.2022 1469.8349 1436.5356 1616.146 1702.494    10
 parLapplyLB  809.2697  863.3299  951.1041  914.9503 1014.907 1189.897    10

I have relatively weak CPU, so in other environment the gain must be big, but in the end, probably writing SQL part could be a bottle neck as reading is really fast for read_xlsx.

Note

I also tried other packages like gdata and xlsx. These are painstakingly slow, so not worth considering.

Try to include `fread` as well. I found it to be great for large files. https://www.rdocumentation.org/packages/data.table/versions/1.12.0/topics/fread — DJV, Apr 04 '19 at 11:07

Fastest way to read large Excel xlsx files? To parallelize or not?

2 Answers2

Data generation

Benchmark

Note

Linked