0

What's the best way to improve the efficiency of loading several tables from a MySQL (Maria) DB into R Shiny using the pool package?

At the moment I'm using a for loop. Should I initialise the tables or is there a better way (an apply function)?

library(shiny)
library(DBI)
library(pool)
library(dplyr)

pool <- dbPool(
  drv      = RMariaDB::MariaDB(),
  dbname   = "db",
  host     = "localhost",
  username = "user", 
  password = "",
  port     = 3306,
  bigint   = "numeric"
)
tabFull <- dbListTables(pool)
tablesExport <- tabFull[grepl("export", tabFull)]

  for(i in 1:length(tablesExport)){
     tmp <- data.frame(tbl(pool, tablesExport[i]))
     tmp$exclDate <- tmp$uploaded_at + (365*24*60*60)  
     tmp <- filter(tmp, !(exclDate > Sys.time() & !(private %in% c(0, NA))) & excluded %in% c(0, NA)) 
     assign(tablesExport[i], select(tmp, -exclDate, -excluded, -export_time, -private)

     #assign(tablesExport[i], `[[<-`(get(tablesExport[i]), 'exclDate', value = as.POSIXct(eval(parse(text = tablesExport[i]))$uploaded_at) + (365*24*60*60)))
     #assign(tablesExport[i], filter(eval(parse(text = tablesExport[i])), !(exclDate > Sys.time() & !(private %in% c(0, NA))) & excluded %in% c(0, NA)))
     #assign(tablesExport[i], select(eval(parse(text = tablesExport[i])), -exclDate, -excluded, -export_time, -private))
  }

A reprex is difficult when there's a DB involved so I hope this pseudo code will do. All code is up on github too, so if you need more context see here for shiny and here for the sql.

Simon
  • 675
  • 1
  • 6
  • 15
  • 1
    Wow, !@! BINGO !@!. I don't know that I've seen code -- in one line, even -- that combines `assign`, `get`, `eval`, and `parse` (all *high-risk* functions), and `\`[[<-\`` (which is fine, but primarily good for code-golf and/or obfuscated code). The first four functions I listed are all that I generally *strongly discourage using*, and you've combined them into one line. I'm guessing you have reason for it, but ... *dang* ... – r2evans Mar 04 '21 at 22:07
  • 1
    Perhaps less poignantly ... I don't think minimizing R code is going to have significant performance improvements, so if your use of everything-on-one-line is for run-time, then I suggest that *readability* is hand-in-hand with *maintainability*, which means that in the long term it'll be easier to update or troubleshoot. Also, you `assign` to the same `tablesExport[i]` four times each pass; it would be far better (and more readable :-) to work with a temporary variable in the `for` loop, and if you really need `assign`, do it *once* at the end of each loop iteration. – r2evans Mar 04 '21 at 22:11
  • 1
    To answer your questions: the only `apply` function I'd recommend is `lapply`, in line with https://stackoverflow.com/a/24376207/3358227 (storing a `list`-of-`data.frame`s, a strong alternative to using `get`/`assign`. – r2evans Mar 04 '21 at 22:13
  • 1
    Thanks for the comments @r2evans. Not the best bingo win, but it worked for a start. Now time to make some improvements. All ears – Simon Mar 04 '21 at 22:13
  • @r2evans I'm playing along at home and started to make edits based on your comments – Simon Mar 04 '21 at 22:31
  • 1
    See my answers ... lots of stylistic projection, my apologies, take it all with a grain of salt. It is true, though, that general best practices in robust/resilient R code discourages the use of all four of those functions, sparing very specific exceptional situations. Even then, they are still risky. – r2evans Mar 04 '21 at 22:33

1 Answers1

2

I suggest you write one function that does all of which you have in your for loop and iterate over a list of tables more simply (at least more readable).

I think I understand your motivation for working in side-effect, assigning each table to its own variable, so I'll stick with that (but offer an alternative anyway). I'll also stick with your use of dplyr verbs and just go all-in with pipes (for readability).

library(dplyr)
func <- function(tblnm, con, envir = NULL) {
  now <- Sys.time()
  yearseconds <- (365*24*60*60)
  dat <- data.frame(tbl(con, tblnm)) %>%
    mutate(exclDate = uploaded_at + yearseconds) %>%
    filter(!(exclDate > now & !(private %in% c(0, NA))) & excluded %in% c(0, NA)) %>%
    select(-exclDate, -excluded, -export_time, -private)
  if (!is.null(envir)) assign(tblnm, dat, envir = envir)
  invisible(dat)
}

With this, you can replace much of above with

for (i in seq_along(tablesExport)) {
  func(tablesExport[i], con = pool, envir = .GlobalEnv)
}

Note that I was explicit about passing both the connection object (your pool) and the environment in which you wanted the side-effect. I did this to be more declarative, as well as supporting perhaps other uses (within non-global environments, who knows).

One thing that that supports is operating without side-effect, working with a list of frames that I mentioned in my comment. Instead of the for loop and assigning each table to its own variable, you can then do:

dat <- lapply(setNames(nm=tablesExport), func, con = pool, envir = NULL)

The envir=NULL is not strictly needed because the default in the function definition is that, but I did it here to be perfectly clear: I don't want side-effect. Using this dat, you would then use dat[[ tableExports[3] ]] or (if one table is named UserList) dat$UserList to access individual tables.

A couple of notes about some of my style choices:

  • I try to not use 1:length(...); while it works more of the time, if length(.) is ever 0, one would hope that the for loop will do nothing, but unfortunately it sees 1:0 which resolves to c(1, 0), not an empty vector. seq_along(x) is equivalent to seq_len(length(x)) is equivalent to 1:length(x) except when x is empty.
  • I used lapply(setNames(nm=tablesExports),...) since the default action of lapply is to name its elements on output only if the object is named going in. While sapply does name things intuitively (it substitutes/deparses the vector/list), it can also return a vector or a list, which can be frustrating for pure programmatic use. I prefer lapply for that reason. Alternatively, I could have used sapply(tableExports, func, con=pool, envir=NULL, USE.NAMES=FALSE) for the same effect.
  • I defined now and yearseconds mostly because I try to resist hard-coded constants in code without clear definition. Naming them something meaningful is useful and declarative. (I've had to debug modules that used for instance 10 in various places throughout the code ... some of them meant the same constant, some meant something different, so a global-search-replace to update one of the constants failed.)
  • I used invisible for the option of using it with side-effect; without invisible, then func(somename,...) would dump a frame to the console, which may not be what you want. invisible allows you to return a value, but if the user does not explicitly capture it into an object, then don't pummel the console with a frame rendering. When used as somevar <- func(...), it does absolutely nothing.
  • One last thing ... a more defensive version of func would check for the existence of those column names before arbitrarily filtering on them. One would then either (a) fail immediately, or (b) only filter on a column if the column is present. While (b) could still produce "interesting" effects in down-stream code, it would still allow the down-stream code to execute.
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Great answer. I appreciate the style explanations too. 1:seq_len(length(x)) is filling all my for loops now. P.S. Do you have any links to the perils of assign, eval, parse and get? – Simon Mar 04 '21 at 22:39
  • 1
    On your P.S., not authoritative links, mostly just experience here on StackOverflow and in the wild in my day job with other R programmers. If you read enough questions on SO that ask about using `eval`/`parse`, you'll find lots of warnings including inadvertent code-injection (similar to sql injection, even when not malicious). And usually, the desire to use `assign`/`get` are predicated on an architecture that has not considered long-term effects and things like ease of troubleshooting. Sorry, I should find a canonical reference to why those should be used sparingly ... – r2evans Mar 04 '21 at 22:41
  • No problem. Just curious. Also curious about invisible(). Googling now ... – Simon Mar 04 '21 at 22:43
  • 1
    See my comment about `invisible`. Perhaps overkill for your use, I generalized the function a little based on multiple use-cases. – r2evans Mar 04 '21 at 22:45
  • Doing some rough benchmarking (system.time) I get similar times. Func + lapply - 67.89, 67.29; Func + for-loop - 68.16, 66.59; Old - 67.297, 67.23. I think I need to have a closer look at the pool package – Simon Mar 05 '21 at 10:58
  • 1
    If you're only doing a one-time query of all data from those tables, the `pool` package gets you very little; you may be able to go straight to the `odbc` package instead. The biggest benefit (to me) of the `pool` package is its auto-reconnect functionality, where when a connection dies (timeout, network glitch, who knows), it will attempt to reconnect the next time you try to use it; normal `DBI` connections (including `odbc`, `RPostgres`, etc) will just fail and make *you* do the reconnection. (`pool` should not be slowing you down, either.) – r2evans Mar 05 '21 at 13:33