The transactional data that I am dealing with comes from two sources, with both having some pros and cons. The first one only has accurate $ and units sold (DF_Lookup
), and the second one has correct demographics (DFI
) but some of the $ and units sold are incorrect. So, I wrote the following code to handle this.
Here are my data:
DFI
dput(DFI)
structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012,
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1,
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA",
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")
DF_Lookup
dput(DF_Lookup)
structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345",
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3",
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011,
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700",
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100),
Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA,
7L), class = "data.frame")
First Try:
Strategy: - Use Join to overwrite the entries from DF_Lookup
in DFI
DF_Generated <- DFI %>%
left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
dplyr::mutate(Count = n()) %>%
dplyr::ungroup()%>%
dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)
Expected Output:
dput(DF_Generated)
structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012,
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Location1 = c("MA", "NY", "WA", "NY", "WA", "NY", "IL",
"IL", "MN", "CA"), Revenue = c(16.6666666666667, 16.6666666666667,
16.6666666666667, 35, 35, 35, 100, -50, 50, 100), Quantity = c(1,
1, 1, 10, 10, 20, 20, -10, 20, 40)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), .Names = c("PO_ID", "SO_ID",
"F_Year", "Product_ID", "Location1", "Revenue", "Quantity"))
Challenge: This works well for smaller dataset. The original data I am dealing with has about ~90M records. So, above code takes forever.
Second Try: Hence, I thought of only updating those rows that have $ and units outside of +/-10% range.
Here's my code:
#Find out whether the numbers are within +/-10% range.
DF_Mod<-DFI %>%
dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
dplyr::summarise(Rev_agg = sum(Revenue), Qty_agg = sum(Quantity)) %>%
left_join(DF_Lookup) %>%
dplyr::rowwise() %>%
#check for +/- 10% confidence interval
dplyr::mutate(Compute = ifelse((abs(Rev_agg-Revenue)/Revenue <=0.1) & (abs(Qty_agg-Quantity)/Quantity <=0.1),"N","Y")) %>%
dplyr::rowwise() %>%
dplyr::ungroup() %>%
dplyr::select(PO_ID:Product_ID,Compute) %>%
dplyr::right_join(DFI)
#Now, filter Compute == "Y" and then do the join with DF_Lookup.
DF_Generated_2 <- DF_Mod %>%
dplyr::filter(Compute == "Y") %>%
left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
dplyr::mutate(Count = n()) %>%
dplyr::ungroup()%>%
dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)
#Bind the rows
DF_Final <- rbind(DF_Generated_2,DFI[DF_Mod$Compute=="N",]) #Expected output
Here, DF_Final
is indeed the expected output.
Question: Even after following the above approach, the performance is very slow because of so many joins involved. Is there anyway we can speed up this process? Are there any other better methods to do what I am trying to do?
I'd appreciate your thoughts. I have spent one day on this, and am still nowhere. I am really stuck.