0

I have two datasets and their dput is below:

Scenario: df1 contains data on unitprices for certain dates and times. df2 contains the competitor for each station code in df1 along with prices and competitor brand name. df2 does not have all the unit prices for dates and times in df1. I need to somehow merge df2 in df1 such that I get date-time, Product, ServiceType & Brand but NA in unitprices for competitors so I can impute these values after.

What I am trying to do is as follows:

  1. Find out which date-time unit prices are missing in df2 based on values in df1 conditioned on similar Station Code, Product & Service Type. Each station code in df1 has multiple "stationscode" in df2 as there can be be more than one competitor.

  2. Somehow extract and merge this information so I can impute values after.

Let's suppose df1 is

 date StationCode  ServiceType Product unitprice
1 2017-06-01 06:00:00        1002 Self-Service      GG     1.345
2 2017-06-01 07:00:00        1002       Served      GG     1.456
3 2017-06-01 07:00:00        1002 Self-Service      SP     1.012 
dput(df1) 


structure(list(date = structure(c(1496300400,
     1496304000, 1496304000 ), class = c("POSIXct", "POSIXt"), tzone =
     "Etc/GMT+1"), StationCode = c(1002,  1002, 1002), ServiceType =
     structure(c(1L, 2L, 1L), .Label = c("Self-Service",  "Served"), class
     = "factor"), Product = structure(c(1L, 1L, 2L ), .Label = c("GG", "SP"), class = "factor"), unitprice = c(1.345, 
     1.456, 1.012)), class = "data.frame", row.names = c(NA, -3L))

and df2 is

date compstcode StationCode  ServiceType Product unitprice brand
1 2017-06-01 06:00:00       3456        1002 Self-Service      GG     1.425 Shell
2 2017-06-01 06:00:00       1267        1002       Served      SP     1.406    BP
3 2017-06-01 06:00:00       5488        1002 Self-Service      GG     1.011 Total

dput(df2)
    structure(list(date = structure(c(1496300400, 1496300400, 1496300400
), class = c("POSIXct", "POSIXt"), tzone = "Etc/GMT+1"), compstcode = c(3456, 
1267, 5488), StationCode = c(1002, 1002, 1002), ServiceType = structure(c(1L, 
2L, 1L), .Label = c("Self-Service", "Served"), class = "factor"), 
    Product = structure(c(1L, 2L, 1L), .Label = c("GG", "SP"), class = "factor"), 
    unitprice = c(1.425, 1.406, 1.011), brand = structure(c(2L, 
    1L, 3L), .Label = c("BP", "Shell", "Total"), class = "factor")), class = "data.frame", row.names = c(NA, 
-3L))

I want to extract compstcode, uprice and brand from df2 into df1 such that we have the compstcode and brand not null whereas uprice can be null for all dates and times of df1.

Mako212
  • 6,787
  • 1
  • 18
  • 37
syebill
  • 543
  • 6
  • 23
  • Please see [Creating a Great Reproducible Example in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and update your question, with emphasis on the **minimal dataset/example** part. Also your `dput` for `df1` is incomplete. – Mako212 Jan 23 '19 at 18:12
  • 1
    Done as per instruction. – syebill Jan 23 '19 at 18:22

1 Answers1

1

I think you're essentially looking to join these two data frames based on multiple criteria, but let me know if that doesn't capture what you're trying to do.

library(tidyverse)

df1 %>% left_join(., df2 %>% select(-date), by = c("StationCode", "ServiceType", "Product"))

                 date StationCode  ServiceType Product unitprice.x compstcode unitprice.y brand
1 2017-06-01 06:00:00        1002 Self-Service      GG       1.345       3456       1.425 Shell
2 2017-06-01 06:00:00        1002 Self-Service      GG       1.345       5488       1.011 Total
3 2017-06-01 07:00:00        1002       Served      GG       1.456         NA          NA  <NA>
4 2017-06-01 07:00:00        1002 Self-Service      SP       1.012         NA          NA  <NA>

But you don't have to exclude date, it could similarly be (renaming columns for clarity):

df1 %>% left_join(., df2 %>% rename(compDate = date, compunitprice = unitprice), by = c("StationCode", "ServiceType", "Product"))
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • This is useful. What I am looking for is a resulting frame with values for all dates and times in df1 by station code, for all combinations of service type and product for each of the competitor in df2 - the unit prices of competitors. some values will be there other competitor unit prices will be NA. Any ideas please? – syebill Jan 23 '19 at 18:53
  • @syebill Just made a change, let me know if that's what you're looking for. – Mako212 Jan 23 '19 at 18:56
  • @syebill double check your syntax and that you haven't made any changes to `df1/df2` from the versions you provided. This still runs correctly for me, and I double checked that I haven't made any changes to `df1/df2` – Mako212 Jan 23 '19 at 19:14
  • Thank you for your help. I have updated the data as the solution did not work on this amended data set. Can you have a look? Could it be to do with the type of join? – syebill Jan 23 '19 at 19:18
  • @syebill using your updated `df1` the above codes still runs, I've updated my output. – Mako212 Jan 23 '19 at 19:21