13

I'm using dplyr (0.7.0), dbplyr (1.0.0), DBI 0.6-1, and odbc (1.0.1.9000). I would like to do something like the following:

db1 <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer",
  Database = "DB1"
)
db2 <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer",
  Database = "DB2"
)
x <- tbl(db1, "Table1") %>%
  dplyr::left_join(tbl(db2, "Table2"), by = "JoinColumn") 

but I keep getting an error that doesn't really seem to have any substance to it. When I use show_query it seems like the code is trying to create a SQL query that joins the two tables without taking the separate databases into account. Per the documentation for dplyr::left_join I've also tried:

x <- tbl(db1, "Table1") %>%
      dplyr::left_join(tbl(db2, "Table2"), by = "JoinColumn", copy = TRUE) 

But there is no change in the output or error message. Is there a different way to join tables from separate databases on the same server?

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
stat_student
  • 787
  • 10
  • 17
  • 1
    You can use `dbplyr::collect()` to force/pull the data from each db and then do the joining in the local R instance. This suggests you should do your filtering and basic selection before the `collect()` calls in order to not over-pull data. – r2evans Jun 22 '17 at 15:49
  • 2
    My backup plan is to pull in both sets of data and then do the joining locally in R. The problem is that sometimes the data sets are very large and I want to filter the results using a column from a separate data base before pulling in the data locally. – stat_student Jun 22 '17 at 16:30
  • Frankly, doing multi-DBMS join/filtering like this is hard in any language (other than native SQL with foreign keys, etc). If you provide representative sample data from each table (i.e., minimal but sufficient variability to represent), we might be able to help, but I think you're going to be stuck pulling a lot of data from at least one of the DBMSs. – r2evans Jun 22 '17 at 17:27

3 Answers3

16

I'm assuming from the code you provided that (a) you're interested in joining the two tbl objects via dplyr's syntax before you run collect() and pull the results into local memory and that (b) you want to refer directly to the database objects in the call to tbl().

These choices are important if you want to leverage dplyr to programmatically build your query logic while simultaneously leveraging the database server to INNER JOIN large volumes of data down to the set that you're interested in. (Or at least that's why I ended up here.)

The solution I found uses one connection without specifying the database, and spells out the database and schema information using in_schema() (I couldn't find this documented or vignetted anywhere):

conn <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer"
)

x <- tbl(src_dbi(conn),
         in_schema("DB1.dbo", "Table1")) %>%
  dplyr::left_join(tbl(src_dbi(conn),
                       in_schema("DB1.dbo", "Table2")),
                   by = "JoinColumn")
SamyIshak
  • 411
  • 5
  • 9
2

I faced the same problem and I wasn't able to solve it with dplyr::left_join.

At least I was able to do the job using the following workaround. I connected to SQL Server without declaring a default database, then I ran the query with sql().

con <- dbConnect(odbc::odbc(), dsn="DWH" ,  uid="", pwd= "" )

data_db <- tbl( con, sql("SELECT * 
                    FROM DB1..Table1 AS a
                    LEFT JOIN DB2..Table2 AS b ON a.JoinColumn = b.JoinColumn") ) 

data_db %>% ...

Hope it helps.

Farhad
  • 4,119
  • 8
  • 43
  • 66
zanocom
  • 21
  • 2
-3

I would use the merge() function to perform the left join the on the tables. It would be something like x <- merge(df1, df2, by = "JoinColumn", all.x = TRUE).

  • 5
    I don't believe `merge()` is enabled to work with `dbplyr` and database connections in the same way that `dplyr` functions like `left_join()` are. – stat_student Jun 22 '17 at 16:31