5

I have below mentioned dataframe in R.

ID       Amount     Date
IK-1     100        2020-01-01
IK-2     110        2020-01-02
IK-3     120        2020-01-03
IK-4     109        2020-01-03
IK-5     104        2020-01-03

I'm using ID to fetch some details from MySQL using the following code.

library(RMySQL)

conn<- connection

query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
           where c.ID IN (", paste(shQuote(dataframe$ID, type = "sh"),
                                      collapse = ', '),") 
and e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');")

res1 <- dbGetQuery(conn,query)

res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"

The above code is working fine, If i pass ~1000 ID but it throws R termination error when passing 10000 or more ID at a time.

How can I create a loop and pass the Id in batched to get the one final output for 10000 ID.

Error Message:

Warning message:
In dbFetch(rs, n = n, ...) : error while fetching rows
Vector JX
  • 179
  • 4
  • 23
  • Just guessing, can we not join the r dataframe, too? Replace your "where ... " with `JOIN dataframe[, "ID", drop = FALSE] x ON x.ID = e.role_id` ? – zx8754 Feb 03 '20 at 08:43
  • @zx8754: I tried, It didn't work. – Vector JX Feb 03 '20 at 08:48
  • @Vector JX could you please include the error msg. – A. Suliman Feb 03 '20 at 08:48
  • @A.Suliman: After running for around 10min the R got terminated as session expire, without showing any error message. The code is working fine when passing around 1000 ID. – Vector JX Feb 03 '20 at 08:51
  • @A.Suliman: sometime getting error message like `Warning message: In dbFetch(rs, n = n, ...) : error while fetching rows` – Vector JX Feb 03 '20 at 08:52
  • @VectorJX I hope this [MySQL IN condition limit](https://stackoverflow.com/questions/4275640/mysql-in-condition-limit/4275704) can help. also see this one https://stackoverflow.com/questions/9874688/rmysql-fetch-errors-rs-dbi-driver-warning-error-while-fetching-rows – A. Suliman Feb 03 '20 at 09:03

4 Answers4

5

Pass the ID's data frame into a temp table before your SQL query and then use that to inner join on the ID's that you are using, this way you can avoid looping. All you gotta do is use dbWriteTable and set the parameter temporary = TRUE when calling it.

EX:

library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), user='user', 
password='password', dbname='database_name', host='host')
#here we write the table into the DB and then declare it as temporary
dbWriteTable(conn = con, value = dataframe, name = "id_frame", temporary = T)
res1 <- dbGetQuery(con = conn, "SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
Inner join id_frame idf on idf.ID = c.ID 
and e.Parameters in
       ('Section1',
       'Section2','Section3',
       'Section4');")

This should improve performance on your code as well as you won't need to loop in R anymore with the where statement. Let me know if it isn't working properly.

  • I'm getting error `could not run statement: CREATE command denied to user` – Vector JX Feb 07 '20 at 17:56
  • Is there any way for your System admin to grant you create access to temporary tables? You probably have read-only access to the DB. There is a difference in permission to CREATE and CREATE TEMPORARY TABLE. I just want to verify you have temp table permissions. – masterbingo1 Feb 07 '20 at 19:01
  • No, I don't have write access. – Vector JX Feb 07 '20 at 19:09
2
# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)

# create the initial dataframe
df <- tribble(
    ~ID,       ~Amount,     ~Date
    , "IK-1"    , 100       , 2020-01-01
    , "IK-2"    , 110       , 2020-01-02
    , "IK-3"    , 120       , 2020-01-03
    , "IK-4"    , 109       , 2020-01-03
    , "IK-5"    , 104       , 2020-01-03
)

# first helper function
createIDBatchVector <- function(x, batchSize){
    paste0(
        "'"
        , sapply(
            split(x, ceiling(seq_along(x) / batchSize))
            , paste
            , collapse = "','"
        )
        , "'"
    )
}

# second helper function
createQueries <- function(IDbatches){
    paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
    LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
    LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,") 
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}

# ------------------------------------------------------------------

# and now the actual script

# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)

# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'" 

# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)

cat(queries) # use cat to show what they look like

# it looks like this:

# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');

# and now the loop
df_final <- data.frame() # initialize a dataframe

conn <- connection # open a connection
for (query in queries){ # iterate over the queries
    df_final <- rbind(df_final, dbGetQuery(conn,query))
}

# And here the connection should be closed. (I don't know the function call for this.)
Georgery
  • 7,643
  • 1
  • 19
  • 52
0

As the links by @A.Suliman suggest this is most likely due to having a large number of values in you IN-clause. Here are some solutions to try:

Batch processing

I am a fan of using modulo to batch process. This assumes the ID values you are batching across are numeric:

num_batches = 100
output_list = list()

for(i in 1:num_batches){
    this_subset = filter(dataframe, ID %% num_batches == (i-1))

    # subsequent processing using this_subset

    output_list[i] = results_from_subsetting
}
output = data.table::rbindlist(output_list)

In your case it looks like the ID takes the form XX-123 (two characters, a hyphen, followed by some numbers). You can convert this into a number using: just_number_part = substr(ID, 4, nchar(ID)).

Temporary file writing

If you were to write dataframe from R to sql then you would not need such a large IN-clause and could use a join instead. The dbplyr package includes a function copy_to that can be used to write temporary tables to the database.

This would look something like:

library(RMySQL)
library(dbplyr)

conn<- connection

copy_to(conn, dataframe, name = "my_table_name") # copy local table to mysql

query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
INNER JOIN my_table_name a ON a.ID = c.ID # replace IN-clause with inner join
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
WHERE e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');")

res1 <- dbGetQuery(conn,query)

For reference I recommend the tidyverse documentation. You might also find this question on writing using copy_to helpful for debugging.

Increasing the timeout delay

When there are a lot of values in an IN-clause the query becomes much slower to execute as an IN-clause is essentially translated to a sequence of OR statements.

According to this link you can change the timeout options for MySQL by:

  • Edit your my.cnf (MySQL config file)
  • Add the timeout configuration and adjust it to fit your server.
    • wait_timeout = 28800
    • interactive_timeout = 28800
  • Restart MySQL
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Getting error `could not run statement: CREATE command denied to user`. – Vector JX Feb 07 '20 at 18:02
  • I assume this is when you attempt `copy_to`. It appears that you do not have permission to create new tables in the MySQL database. So the second approach I suggested will not work in your configuration – Simon.S.A. Feb 07 '20 at 22:53
0

Maybe just a try...

Following an above comment, there might be a size limit in the MySQL IN (...) condition. Maybe you could bypass it by splitting the entire list of dataframe$IDs in sublist and re-writing your query with a condition like :

WHERE c.ID IN sublist#1
OR c.ID IN sublist#2
OR c.ID IN sublist#3
...

instead of an unique c.ID IN list ?

Let's say that we make sublists with a max length of 1000, it could give :

sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))

and then, you could build a string like "OR c.ID IN (...) OR c.ID IN (...) OR c.ID IN (...) ...

Inserted in your code, that would give :

library(RMySQL)
conn<- connection
sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))

query <- paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
           where 1 = 1 AND (", # to get rid of the "where"
       paste(lapply(sublists, 
                    FUN = function(x){
                      paste0("OR c.ID IN (",  paste(shQuote(x, type = "sh"), collapse = ', '), ")")
                    }), 
             collapse = "\n"), ")
and e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');") %>% cat

res1 <- dbGetQuery(conn,query)

res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"
Jrm_FRL
  • 1,394
  • 5
  • 15