0

I am trying to left_join two datasets and minimize duplicates from the join. The purpose of joining the data is to match information from df_2 that relates to coordinates of a postcode for each a buyer and a seller in df_1.

 At the moment I am performing the join one after the other for the buyer and seller, but it just leads to duplicates. I have looked into perhaps a conditional join where postcode and suburb are the same in each row of df_1, but not quite sure if this is the right approach.

 An example dataset is below:

df_1 <- data.frame(buyer = c('Sarah', 'John', 'Sam', 'Sally'), Seller = c('B corp', 'M Ltd', 'S and co', 'M Ltd'), purchase = c('hat', 'shirt', 'ball', 'shoe'), buyer_suburb = c('Sandy', 'Rocky', 'Leafy', 'Sandy'), buyer_postcode = c('001', '002', '003', '008'), seller_suburb = c('Sandy', 'Leafy', 'Ocean', 'Leafy'), seller_postcode = c('001', '003', '004', '009'))

df_2 <- data.frame(suburbs = c('Sandy', 'Rocky', 'Leafy', 'Ocean', 'Sandy', 'Leafy'), postcode = c('001', '002', '003', '004', '008', '009'), coordinates = c('0.01, 2.00', '0.02, 3.00', '0.03, 4.00', '0.02, 5.00', '0.02, 8.00', '0.02, 9.00'))

join_df_1 <-  left_join(df_1, df_2,  by=c("buyer_suburb" = "suburbs"))
join_df <-  left_join(join_df_1, df_2,  by=c("seller_suburb" = "suburbs"))
lenlen
  • 79
  • 7
  • 1
    This is way too vague. Be specific enough that code could be written. Preferabley with a [mre]. Show relevant parts you can do. PS There is a common error where people want some joins, each possibly involving a different key, of some subqueries, each possibly involving join and/or aggregation, but they erroneously try to do all the joining then all the aggregating or to aggregate over previous aggregations. Write separate aggregate subqueries over appropriate rows and/or aggregate (maybe distinct) case statements picking rows of (possibly join result) tables; then join the subqueries on keys. – philipxy Jun 17 '22 at 10:16
  • This will be a faq. But one must pin down via a [mre] & then write many clear, concise & precise phrasings of one's question/problem/goal to (re)search. Eg [Two SQL LEFT JOINS produce incorrect result](https://stackoverflow.com/q/12464037/3404097) etc etc. – philipxy Jun 17 '22 at 10:19
  • What you say re the result you want is still vague. "relates" means nothing in particular. Also you don't say why you give that code or why you don't think it is OK or not OK. Please either ask about bad code with the obligatory [mre] & why you think it should return something else at the 1st subexpression that it doesn't give what you expect justified by reference to authoritative documentation or ask about your overall goal. But please ask about the former 1st because misconceptions will get in the way of understanding the latter. And bad code doesn't tell us what you wish it would do. – philipxy Jun 17 '22 at 11:29
  • Please format code reasonably. Someone formatted your 1-liners but your update removed that code & put new one-liners in. Please edit by clicking on 'edit'/'edited' before you update so you see the most recent version since someone might have edited. [ask] [Help] – philipxy Jun 17 '22 at 12:01

1 Answers1

2

You could do this and explicitly rename the coordinates to buyer_coord and seller_coord.

library(tidyverse)

# Data
df_1 <- data.frame(
  buyer = c('Sarah', 'John', 'Sam'), 
  Seller = c('B corp', 'M Ltd', 'S and co'), 
  purchase = c('hat', 'shirt', 'ball'), 
  buyer_suburb = c('Sandy', 'Rocky', 'Leafy'), 
  buyer_postcode = c('001', '002', '003'), 
  seller_suburb = c('Sandy', 'Leafy', 'Ocean'), 
  seller_postcode = c('001', '003', '004'))

df_2 <- data.frame(
  suburbs = c('Sandy', 'Rocky', 'Leafy', 'Ocean'), 
  postcode = c('001', '002', '003', '004'), 
  coordinates = c('0.01, 2.00', '0.02, 3.00', '0.03, 4.00', '0.02, 5.00'))

# Proposed approach
df_1 |> 
  left_join(df_2, by = c("buyer_suburb" = "suburbs", "buyer_postcode" = "postcode")) |> 
  rename(buyer_coord  = coordinates) |> 
  left_join(df_2, by = c("seller_suburb" = "suburbs", "seller_postcode" = "postcode")) |> 
  rename(seller_coord  = coordinates)

#>   buyer   Seller purchase buyer_suburb buyer_postcode seller_suburb
#> 1 Sarah   B corp      hat        Sandy            001         Sandy
#> 2  John    M Ltd    shirt        Rocky            002         Leafy
#> 3   Sam S and co     ball        Leafy            003         Ocean
#>   seller_postcode buyer_coord seller_coord
#> 1             001  0.01, 2.00   0.01, 2.00
#> 2             003  0.02, 3.00   0.03, 4.00
#> 3             004  0.03, 4.00   0.02, 5.00

Created on 2022-06-17 by the reprex package (v2.0.1)

Carl
  • 4,232
  • 2
  • 12
  • 24