3

I am working with database tables with dbplyr

I have a local table and want to join it with a large (150m rows) table on the database

The database PRODUCTION is read only

# Set up the connection and point to the table

library(odbc); library(dbplyr)    

my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;DATABASE=PRODUCTION;UID=",
                            t2690_username,";PWD=",t2690_password, sep="")

t2690 <- dbConnect(odbc::odbc(), .connection_string=my_conn_string)

order_line <- tbl(t2690, "order_line") #150m rows

I also have a local table, let's call it orders

# fill df with random data

orders <- data.frame(matrix(rexp(50), nrow = 100000, ncol = 5))

names(orders) <- c("customer_id", paste0(rep("variable_", 4), 1:4))

let's say I wanted to join these two tables, I get the following error:

complete_orders <- orders %>% left_join(order_line)

> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)

The issue is, if I were to set copy = TRUE, it would try to download the whole of order_line and my computer would quickly run out of memory

Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.

The only solution I have found is to upload into the writable database and use sql to join them, which is far from ideal

Shinobi_Atobe
  • 1,793
  • 1
  • 18
  • 35
  • If you're joining on one variable only, maybe you could use `IN (a, b, c...)`, i.e. put all the necessary data from the local table in the query, then query the big table only. – moodymudskipper May 21 '18 at 20:54
  • 1
    This would work assuming the local table is small enough... but seeing as it is 100k rows, the database has trouble handling a query that big. Also having 20k+ lines of R code is not so much fun – Shinobi_Atobe May 22 '18 at 08:06
  • Indeed, I'm afraid that if the data can't be put in the query it has to be put in the DB. An hybrid way would beto filter the source table enough to be able to load it in R with collect, using data from your R object into your query. And then finish the job in R. – moodymudskipper May 22 '18 at 08:14

2 Answers2

4

I have found the answer, you can use the in_schema() function within the tbl pointer to work across schemas within the same connection

# Connect without specifying a database
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;UID=",
                            t2690_username,";PWD=",t2690_password, sep="")    

# Upload the local table to the TEMP db then point to it
orders <- tbl(t2690, in_schema("TEMP", "orders"))

order_line <-  tbl(t2690, in_schema("PRODUCTION", "order_line"))

complete_orders <- orders %>% left_join(order_line)
Shinobi_Atobe
  • 1,793
  • 1
  • 18
  • 35
  • I have a similar issue with a read-only database (Redshift), and want to reference a local data frame within my query using `%in%`. However, when I try to connect without specifying `database` (in your first line) it throws an error due to this omission, preventing me from adding my local data frame to some temporary database to complete .. does this solution work specifically for Teradata or am I doing something incorrect here? – daRknight Dec 04 '18 at 17:32
0

Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.

In your use case, it seems (based on the accepted answer) that your databases are on the same server and it's just a matter of using in_schema. If this were not the case, another approach would be that given here, which in effect gives a version of copy_to that works on a read-only connection.

Ian Gow
  • 3,098
  • 1
  • 25
  • 31