0

I've got a very large dataset (~150gbs) stored in an sql database, and I need to query certain rows based on an identifier (id) from another dataframe (df). Using a standard approach as per below it takes about 4 minutes.

dbconn <- DBI::dbConnect(RSQLite::SQLite(), "My_database.db")
src_dbi(dbconn)
tbl <- tbl(dbconn, "My_table")

database_selection <- tbl %>%
    filter(id %in% local(df$id)) %>%    
    collect()

I'm trying to exploit my 48 core processor to do this faster, but I'm running into some issues, specifically the process doesn't complete even after about 30 minutes when I end up terminating it. I'm unclear whether this is because there are some overheads which I'm not understanding which make setting this up in parallel very costly, or there's just an issue with my code.

My approach is based on the solution to this: Parallel processing and querying SQL with dplyr or pool: MySQL server has gone away but using sockets & parLapply as I'm using a windows machine.

Specifically:

my_function <- function(ids) {
  # create a connection and close it on exit
  dbconn <- DBI::dbConnect(RSQLite::SQLite(), "My_database.db")
  on.exit(DBI::dbDisconnect(dbconn))
  
  # connect to table
  tbl <- tbl(dbconn, "My_table")
  
  # fllter rows
  my_selection <- tbl %>% 
    filter(id %in% local(ids)) %>%
    collect()  

  return(my_selection)
  
}

system.time({
  numCores <- detectCores() - 1
  cl <- makeCluster(numCores, type = "PSOCK")
  clusterEvalQ(cl,{
    library(RSQLite)
    library(dplyr)
    library(dbplyr)
  })
  database_selection<- parLapply(cl = cl, df$id, my_function)
  stopCluster(cl)
})


Any suggestions would be much appreciated.

Nik-D
  • 31
  • 3
  • It might be possible to cut the time down without dealing with the parallel processing, but idk if you're asking this as a "I want to be able to do this in the future so I'm figuring it out for this one thing" thing. – Abigail Feb 26 '23 at 22:21
  • Yeah I’ll be doing a lot of these queries so was trying to figure it out for future use. Open to any suggestions though! – Nik-D Feb 27 '23 at 04:21
  • 1
    I do a lot of back and forth with local to sql, pretty much the exact thing you're describing here. Whenever anything runs slow I do `show_query()` since sometimes it is just writing bad sql. I've also found that selecting only the id column helps since it copies it to the sql database (so less to copy). I usually do a semi_join for this, I'm curious as to how the filter v. semi_join get translated on the backend. And in at least one case, doing a manual `copy_to()` beforehand fixed how long it was taking. – Abigail Feb 27 '23 at 23:54

0 Answers0