0

I have a question that I have answer two but think it is inefficient and so want a better way. I am querying a database using DBplyr and joining tables based upon a few ID columns.

For example, I have one table (Table_1) in the database with an org_ID and a sender_ID, the first is the number for the organization in a given geography and the second a number for organizations with which they interacted in a given geography. So org_ID = 1 in geography 2, org_ID 1 in geography 3, and so forth.

I also have a single identification table (Table_2) with an org_ID column and a corresponding geography. There is not a sender_ID in Table_2, but the sender_ID in Table_2 does match org_ID/geography pairs in Table_1.

I want to combine these tables, but I need to use the Table_1 twice, effectively. What works, but is slow is the following

df <- Table_1 %>%
  left_join(Table_2, by=c('org_ID' = "org_ID" , 'geography' = 'geography' ))%>%
  left_join(Table_2, by=c('sender_ID' = "org_ID" , 'geography' = 'geography')) %>%
#various grouping and summary commands %>%
collect()

Any ideas for better ways?

EHL
  • 43
  • 5
  • 2
    I understand the complexity and potential confusion/munging of data introduced by having to merge the same data twice. I think it would be really helpful to have some sample data to reflect what you have to start with and, just as importantly, what you are hoping to achieve by it. "Reproducible sample data" (https://stackoverflow.com/q/5963269) is hard to do well, but getting it at least close and adding "expected output" do a LOT in helping us to understand the problem and suggest solutions. – r2evans Sep 09 '22 at 00:28
  • Agree with r2evans. Needs details, specifically something to work with. Speaking for the "answering community" ... Not our responsibility. – IRTFM Sep 09 '22 at 01:33

1 Answers1

0

If it is mainly a question of how to do:

df <- Table_1 %>%
  left_join(Table_2, by=c('org_ID' = "org_ID" , 'geography' = 'geography' ))%>%
  left_join(Table_2, by=c('sender_ID' = "org_ID" , 'geography' = 'geography')) %>%
  #various grouping and summary commands %>%
  collect()

more effectively (with only a single join), then I would suggest:

df <- Table_1 %>%
  left_join(Table_2, by = "geography", suffix = c(".x",".y") %>%
  filter(org_ID.x == org_ID.y | sender_ID.x == org_ID.y) %>%
  #various grouping and summary commands %>%
  collect()

However, imposes certain handling of records in Table_1 that do not appear in Table_2: It behaves more like an inner join than a left join.

Regarding peformance:

  • Joins are faster when the underlying tables are indexed. Make sure the database has indexes on the columns you are joining on.
  • Even efficient commands/SQL will have minimal impact on speed if the connection between R and SQL is slow. collect() copies that data from SQL into R, this may be the slowest point of your code.
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41