Efficiently storing datasets with a large number of precompiled variables (and when only a few might be used)

Question

I'm writing a Shiny application where a large dataset is read and some analyses are performed depending on the user's input. These analyses are based on a variable y that can be transformed from another variable x, depending on a value k entered by the user.

The problem here is that that conversion step is time demanding, so right now all possible values of y are precomputed. Consider the following simplified example:

set.seed(1234)
data <- tibble (x = rnorm(n = 10, mean = 5, sd = 1) )

k_vector <- seq(from=1,to=3,by=1)

for (k in k_vector) {
  new_col = stringr::str_c("y",k)
  data <- dplyr::mutate(data, !!new_col := x*k)
}

Which results in the following table:

   x    y1    y2    y3
<dbl> <dbl> <dbl> <dbl>
1  3.79  3.79  7.59 11.4 
2  5.28  5.28 10.6  15.8 
3  6.08  6.08 12.2  18.3 
4  2.65  2.65  5.31  7.96
5  5.43  5.43 10.9  16.3 
6  5.51  5.51 11.0  16.5 
7  4.43  4.43  8.85 13.3 
8  4.45  4.45  8.91 13.4 
9  4.44  4.44  8.87 13.3 
10  4.11  4.11  8.22 12.3

I'd next save that table as .rda, and read it from the shiny application. Then I would include a command like dplyr::transmute(data,x=x,y=y1) (if k=1 in this case) in a reactive scope so that anytime the user changes the value of k a new variable y is selected. As you can imagine, this solved the problem of converting form x to y given k.

But if the real dataset and/or the number of possible values of k is large then the stored table will be HUGE, so it becomes a problem not just for storing but also of time when reading it. I'm avoiding the creation of N databases, one per value of k, in the hope that there is a more efficient way to accomplish this task. Any ideas?

The transformation is simply `x * k`? Hard to imagine that as a demanding conversion — duckmayr, May 14 '19 at 11:19
Could this help at all? https://stackoverflow.com/a/56030403/9555388 — Hector Haffenden, May 14 '19 at 11:40
Of course transformation is more complex, this was just an illustrative example. @Hector Haffenden I will check it. It looks promising indeed. — elcortegano, May 14 '19 at 12:52

ismirsehregal · Answer 1 · 2019-05-14T19:59:56.453

1

It will depend on your actual use case, but library(data.table) often is a good place to start from, regarding performance in R:

library(dplyr)
library(data.table)
library(microbenchmark)

set.seed(1234)
data <- tibble(x = rnorm(n = 10, mean = 5, sd = 1))
DT <- setDT(data)

k_vector <- seq(from = 1, to = 3, by = 1)

results <- microbenchmark("dplyr" = {
  for (k in k_vector) {
    new_col = stringr::str_c("y", k)
    data <- dplyr::mutate(data, !!new_col := x * k)
  }
}, "data.table" = {
  DT[, paste0("y", k_vector) := lapply(k_vector, "*", x)]
})

results

results:

Unit: microseconds
       expr      min       lq      mean   median       uq       max neval
      dplyr 3544.318 4039.898 5514.8110 4622.190 5995.553 13434.590   100
 data.table  355.933  415.584  667.1352  519.678  637.400  4388.128   100

edited May 14 '19 at 19:59

answered May 14 '19 at 11:58

ismirsehregal

30,045
5
31
78

Didnt know about this difference in performance with dplyr. I'll consider this option, though I fear it does not solve the root problem (not sure yet if it has a better solution). – elcortegano May 14 '19 at 12:57
1

Sure, just wanted to point in the direction - as I mentioned it depends on your usecase. – ismirsehregal May 14 '19 at 13:07

score 0 · Accepted Answer · answered May 19 '19 at 09:21

All right, this is what I'm finally using. As mentioned in the question, the limiting step is both storing and reading the processed data sets into the shiny app. The pre-processing step is not that limiting, since it's done just once and then data is read from there.

So below I show how storing that information in a binary format with saveRDS gives a faster read of the stored tables (both tibble and data.table). These files will also be smaller in size than the ones stored by save, and obviously any plain text format (eg. write_csv). Based on @ismirsehregal answer, I will also use an approach with microbenchmarking:

library(dplyr)
library(data.table)
library(microbenchmark)

set.seed(1234)
data <- tibble(x = rnorm(n = 10, mean = 5, sd = 1))
DT <- setDT(data)

save(data, file = "data.rda")
saveRDS(data, file = "data.rds")
save(DT, file = "DT.rda")
saveRDS(DT,file = "DT.rds")

results <- microbenchmark("read tibble" = {
  load("data.rda")
}, "read tibble (RDS)" = {
  data <- readRDS("data.rds")
}, "read DT" = {
  load("DT.rda")
}, "read DT (RDS)" = {
  DT <- readRDS <- readRDS("DT.rds")
})

results

# Unit: microseconds
#              expr    min      lq     mean  median      uq    max neval cld
#       read tibble 34.741 35.4355 39.18523 35.9085 36.6930 76.377   100   b
# read tibble (RDS) 20.384 21.0665 24.35454 21.4915 22.4590 78.388   100  a 
#           read DT 34.735 35.4755 38.63269 36.0855 37.3045 72.936   100   b
#       read RDS DT 20.416 20.9580 23.10523 21.3980 22.1375 56.261   100  a

In both instances using readRDS, the read process is faster. In addition, this lower-level function has other advantages over load (eg. it won't overload any name on the workspace), which are summarized in this blog.

Efficiently storing datasets with a large number of precompiled variables (and when only a few might be used)

2 Answers2