SQL select on table using IDs stored in an R data frame

Question

I am trying to load a table across several SQL Teradata tables that are "joined together" using a RODBC connection. There are unique IDs that I want to specify so it only pulls their data. The SQL statement currently works with the "sqlQuery" R function when I insert a few of the IDs in the query text. However, I have many more IDs (e.g., 10,000's, 100,000's) that I want to pull the data for. I only have these IDs stored within my local R environment, so I cannot join by it from a SQL table. My question is how can I pull the data for these IDs by referencing them using the R environment variable?

I have attempted to provide an example that illustrates the problem. The example shows a working case that just "pastes" 10 IDs directly into the SQL statement. However, this option does not work when their are many more "IDs" (10,000's, 100,000's, etc.). Unfortunately I don't have permission to the Teradata DB, so I cannot join by IDs there. I only have the IDs from a .csv file, SAS dataset, etc. Also I can't just place all the IDs in the SQL statement like this example, because it is way too long with all those IDs. I have seen how others use SAS SQL statements that "reference" the IDs that are in the local SAS environment (uses * symbol), so I hope that I can do something similar in R. Thank you.

#for loading data
library(RODBC)

# Connect to Teradata SQL server
TeradataConnect <- odbcConnect(dsn="****")

#random sample of 10 IDs but their are  many more (e.g., 100,000's)
IDs = c(855820852, 1461838083,  861608839, 1498698119,  817883914,    776309362,
        799069685,  825438654,  856801569, 1749317792)

# specificy dates you want to pull data over
LowDate = Sys.Date() - 365
UpDate = Sys.Date()

#SQL query for pulling data 
SQL_query =  paste("SELECT  Distinct S.ID  
               ,S.LOC_ID
               ,S.DATE
               ,I.ITEM_DESC

               FROM  table.DATA   S
               INNER JOIN item.DATA I
               ON  S.ID=I.ID 

               WHERE  1 =1
               AND I.ITEM_TYPE= 'NORMAL'
               AND S.DATE BETWEEN ", paste(paste0("'", LowDate, "'")), " AND ", paste(paste0("'", UpDate, "'")), "
               AND S.ID IN (", paste(paste(IDs, collapse = ", ")), ") 
               GROUP BY 
               S.ID
               ,S.LOC_ID
               ,S.DATE
               ,I.ITEM_DESC
               ")

#pulls all data from a random sample of IDs
Dataset = sqlQuery(TeradataConnect, SQL_query)

Update: Thanks @RYoda for your suggestions. I am adding an example of the SAS code so you can see how it works. I am not a SAS expert, but I received these queries from someone who is. It is my understanding that anytime they use a "&" character (e.g., "ids." where ids is a variable on the SAS environment) in the SQL statement it will use the data that is within the SAS environment. Unfortunately I don't have write access to the SQL table so I can't let the database do the work. It looks like use a workaround like you suggested, because the prefiltered data is much too large for me to load within my R environment and then perform a join using a data.table or dplyr join. I was hoping that I could find a more elegant solution, since their is a lot of the time where I am trying to use R environment variables like this to load/filter on SQL table data pulls. I think the expected final filtered data size will be several millions of rows and ~10 columns. Thanks again for your help.

PROC SQL;
 CONNECT TO teradata(db=VIEWS user=&td_user. password=&td_pwd. tdpid=tableid mode=teradata);
 CREATE TABLE Str_inv AS
 SELECT * FROM connection to Teradata
                ( 
                SELECT 
                    nbr1,
                    nbr2,
                    b.item_nbr,
                    sum(inv.qty) AS QTY
                FROM 
                    table.inventory inv
                INNER JOIN
                    table.hier b
                ON
                    inv.item_id=b.item_id
                INNER JOIN 
                    table.loc c
                ON
                    inv.str_loc_id = c.loc_id
                WHERE 
                    inv.cal_dt = &fw_end_dt. AND
                    b.ITEM_TYP_DESC IN ('NORMAL') AND
                    b.CORE_ID = 1 AND
                    b.item_id IN (&ids.)
                GROUP BY
                    1,2,3
                );
QUIT;

Where is this SQL statement? Consider writing R dataframe as temp table in Teradata DB (if you have permission) then join by IDs. — Parfait, Jan 10 '17 at 01:13
Just because I am curious: Do you have a code snippet for *"SAS SQL statements that "reference" the IDs"* — R Yoda, Jan 11 '17 at 07:06
If the number of IDs used to filter in the WHERE clause is too big you have only two choices: Load all the prefiltered data into R and implement the JOIN and group by logic (e. g. using the package `data.table`) or bring the IDs to the database into a (temp) table as @Parfailt suggested. Letting the database do the work is much more efficient (performance) since you do not have to transfer all the data to the R client. A workaround could be to loop over groups of IDs (e. g. blocks of 100 IDs), construct the `in` clause, fetch the data and finally group all the fetched data in the R client. — R Yoda, Jan 11 '17 at 07:13
Can you please specify the amount of input data in the database, the expected size of the data after filtering using the IDs and the expected size after the final processing (group by)? This could give us an estimation whether client side processing a realistic scenario at all... — R Yoda, Jan 11 '17 at 07:13
Possible duplicate of [RODBC using Data.Frame in a Join on sqlQuery()](http://stackoverflow.com/questions/29990387/rodbc-using-data-frame-in-a-join-on-sqlquery) — R Yoda, Jan 11 '17 at 13:06
See update above from the questions. It looks like I will have to try the work around by breaking it up into many chunks. Thanks! — Kevin, Jan 11 '17 at 13:35
@Kevin Thx for the SAS example (it looks like it has only a short syntax version to list `IN` values from a local vector variable)! I think the limiting factor for the `in` clause is the 4096 characters limit of SQL query strings in `RODBC`, the SQL DB could handle longer SQL queries (even though `in` is not very efficient). Perhaps the `RJDBC` package would allow longer strings. Use the package `data.table` and the according syntax to store an aggregate your data in R otherwise your RAM and CPU will "die" (work forever). — R Yoda, Jan 11 '17 at 16:28

SQL select on table using IDs stored in an R data frame

0 Answers0