17

I have six SQL queries that I script though R that each take a very long time (~30 minutes each). Once each query returns I then manipulate the data for some standard reports.

What I'd like to do is use my multicore machine to run these SQL requests in parallel from R.

I'm on a Windows machine with a Oracle DB. I was following a blog post to use doSNOW and foreach to try and split these requests and this is the best thing I can find on stackoverflow.

I've been able to get the process to work for the non-parallel %do% version of foreach but not the %dopar%. With %dopar% it just returns an empty set. Below is code that sets up tables and runs the queries so you can see what happens. Apologies in advance if there's too much basic code.

I've looked at some of the other R packages but didn't see an obvious solution. Also if you have a better way to manage this kind of process I'd be interested to hear it - just keep in mind I'm an analyst not a computer scientist. Thanks!

#Creating a cluster
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl)

#Connecting to database through RODBC
ch=odbcConnect("",pwd = "xxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)

#Creating database tables for example purposes
qryA1 <- "create table temptable(test int)" 
qryA2 <- "insert into temptable(test) values((1))" 
qryA3 <- "select * from temptable" 
qryA4 <- "drop table temptable" 
qryB1 <- "create table temptable2(test int)" 
qryB2 <- "insert into temptable2(test) values((2))" 
qryB3 <- "select * from temptable2" 
qryB4 <- "drop table temptable2"  

sqlQuery(ch, qryA1) 
sqlQuery(ch, qryA2) 
doesItWork <- sqlQuery(ch, qryA3) 
doesItWork
sqlQuery(ch, qryB1) 
sqlQuery(ch, qryB2) 
doesItWork <- sqlQuery(ch, qryB3) 
doesItWork

result = c()
output = c()
databases <- list('temptable','temptable2')


#Non-parallel version of foreach
system.time(
foreach(i = 1:2)%do%{
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
output[i] = result
}
) 

output

#Parallel version of foreach

outputPar = c()

system.time(
foreach(i = 1:2)%dopar%{
#Connecting to database through RODBC
ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
outputPar[i] = result
}
) 

outputPar

sqlQuery(ch, qryA4)
sqlQuery(ch, qryB4) 
Community
  • 1
  • 1
Andrew Elliott
  • 235
  • 1
  • 3
  • 9
  • 2
    Querying the database in parallel may only make the process slower, depending on what is causing the bottleneck. Are you certain the bottleneck is the R instance running on your machine, (not the database, the network connection, the machine the database is running on, etc.)? – Joshua Ulrich Mar 29 '12 at 19:16
  • Is the `RODBC` package loaded on the cluster nodes? If not, add `.packages="RODBC"` to your `foreach()` call. – BenBarnes Mar 29 '12 at 19:32
  • @Joshua Ulrich - The database is pretty robust since it's a commercial Oracle installation. The queries just join across several tables requiring a lot of processing in the database. The manual way that I would speed this up is to initiate several instances from SQL Developer but an R script does a lot of the post query processing for me so I'd rather have R initiate multiple queries at once so that the longest time that I have to wait is the long query time instead of adding all the query times together. – Andrew Elliott Mar 29 '12 at 19:57
  • @BenBarnes I tried adding `.packages="RODBC"` into the `foreach` but unfortunately I still received a Null set. – Andrew Elliott Mar 29 '12 at 19:57
  • 1
    RODBC is incredible slow in moving data, so if it's returning a lot of data you will get better results by dumping the query results to csv and move over SMB and then import. I got a factor of 1:20 speedup working with a similar issue. – Hansi Mar 30 '12 at 00:39

1 Answers1

14

When you make the assignment outputPar[i] = result inside the serial foreach loop, this is OK (but not really the intended use of foreach). When you make this assignment in the parallel loop, it is not OK. See http://tolstoy.newcastle.edu.au/R/e10/help/10/04/3237.html for a similar question answered by David Smith at Revolution.

As a solution,

system.time(
  outputPar <- foreach(i = 1:2, .packages="RODBC")%dopar%{
#Connecting to database through RODBC
  ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
  odbcGetInfo(ch)
  result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
  result
}
)
BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • 1
    Very nice - I modified the %dopar% code like this and it worked. Thanks for all the help guys. `system.time( test <- foreach(i = 1:2, .packages="RODBC")%dopar%{ #Connecting to database through RODBC ch=odbcConnect(dsn ,pwd = "xxxxxxx", believeNRows=FALSE) #Test connection odbcGetInfo(ch) sqlQuery(ch,paste('SELECT * FROM ',databases[i])) } ) test` – Andrew Elliott Mar 30 '12 at 16:17
  • 2
    What was the difference in speed? – Frank Apr 26 '20 at 02:41