Parallel query of SQLite database in R

Question

I have a large database (~100Gb) from which I need to pull every entry, perform some comparisons on it, and then store the results of those comparisons. I have attempted to run parallel queries within a single R sessions without any success. I can just run multiple R sessions all at once but I am looking for a better approach. Here is what I attempted:

library(RSQLite)
library(data.table)
library(foreach)
library(doMC)



#---------
# SETUP
#---------


#connect to db
db <- dbConnect(SQLite(), dbname="genes_drug_combos.sqlite")


#---------
# QUERY
#---------
# 856086 combos = 1309 * 109 * 6

registerDoMC(8)

#I would run 6 seperate R sessions (one for each i)
res_list <- foreach(i=1:6) %dopar% {

  a <- i*109-108
  b <- i*109

  pb  <- txtProgressBar(min=a, max=b, style=3)
  res <- list()

  for (j in a:b) {

    #get preds for drug combos
    statement   <- paste("SELECT * from combo_tstats WHERE rowid BETWEEN", (j*1309)-1308, "AND", j*1309)
    combo_preds <- dbGetQuery(db, statement)

    #here I do some stuff to the result returned from the query
    combo_names <- combo_preds$drug_combo
    combo_preds <- as.data.frame(t(combo_preds[,-1]))

    colnames(combo_preds)  <- combo_names

    #get top drug combos
    top_combos <- get_top_drugs(query_genes, drug_info=combo_preds, es=T)

    #update progress and store result
    setTxtProgressBar(pb, j)
    res[[ length(res)+1 ]] <- top_combos
  }
  #bind results together
  res <- rbindlist(res)
}

I dont get any errors but only one core spins up. In contrast, if I run multiple R sessions, all my cores go at it. What am I doing wrong?

Are you using mac or windows? and do you have Emacs and ESS installed? — SimranTea, Apr 25 '16 at 02:58
Maybe this article may help https://support.rstudio.com/hc/en-us/articles/211789298-Multiple-R-Sessions-in-RStudio-Server-Pro — SimranTea, Apr 25 '16 at 03:00
You need to create one connection per parallel instance, not one connection that's shared across all cores — hadley, Apr 25 '16 at 16:36
That was it - thank you! In general, do any required objects/connections need to be created for each parallel instance? — alexvpickering, Apr 25 '16 at 21:41

Davor Josipovic · Answer 1 · 2017-06-21T11:52:01.487

5

Some things I have learned while accessing concurrently with RSQLite the same file SQLite database:

1. Make sure each worker has its own DB connection.

  parallel::clusterEvalQ(cl = cl, {
    db.conn <- RSQLite::dbConnect(RSQLite::SQLite(), "./export/models.sqlite");
    RSQLite::dbClearResult(RSQLite::dbSendQuery(db.conn, "PRAGMA busy_timeout=5000;"));
  })

2. Use `PRAGMA busy_timeout=5000;`

By default this is set to 0, and chances are that you will end up with a "database is locked" error each time your worker tries to write to the DB while it is locked. Previous code sets this PRAGMA in each worker connection. Note that SELECT operations are never locked, only INSERT/DELETE/UPDATE.

3. Use `PRAGMA journal_mode=WAL;`

This only has to be set once and stays on by default forever. It will add two (more or less permanent) files to the DB. It will improve concurrent read/write performance. Read more here.

With the above settings I have not experienced this issue.

edited Jun 21 '17 at 11:52

answered Jun 08 '17 at 20:30

Davor Josipovic

5,296
1
39
57

I still get `database is locked` errors. Oddly they sometimes happen when the process has been running for less than the `busy_timeout` (I've tried values between 5s and 2min). Any ideas? I guess switching to postgresql would get rid of these problems. – Galadude Jun 21 '17 at 13:17
1

No need to switch. SQLite is surprisingly fast and easy to use. Check your commits. I have 8 processes running 100% constantly accessing the DB on NTFS partition. If, for example, I start building an index from within the console, I see all 8 processes halt, waiting for the DB to become available again, which happens immediately after index is built. What causes immediate dropouts (i.e. locks) is for example when you run `VACUUM` from within the console. – Davor Josipovic Jun 21 '17 at 14:04
1

I'm using `dbWriteTable`, maybe that's giving trouble. I've been looking for a way to find a log of the database's commits, and can't find anything. Any advice? – Galadude Jun 21 '17 at 14:24
1

Maybe yes. I always use plain SQL: `dbSendQuery` and `dbGetQuery`. – Davor Josipovic Jun 21 '17 at 14:45
1

@Galadude when you switched to `dbSendQuery` did it fix your problem? I'm also getting `database is locked` errors when used `dbWriteTable` – rrs Jun 27 '17 at 21:00
1

dbSendQuery is for SELECT statements, and i need to insert rows. – Galadude Jun 28 '17 at 06:18
1

No, it is not only for SELECT statements. You can use UPDATE/INSERT/DELETE statements too and bind variables through the `param` argument. – Davor Josipovic Jun 28 '17 at 12:40

Parallel query of SQLite database in R

1 Answers1

1. Make sure each worker has its own DB connection.

2. Use PRAGMA busy_timeout=5000;

3. Use PRAGMA journal_mode=WAL;

2. Use `PRAGMA busy_timeout=5000;`

3. Use `PRAGMA journal_mode=WAL;`