Correcting / Tidying Data using R

Question

The transactional data that I am dealing with comes from two sources, with both having some pros and cons. The first one only has accurate $ and units sold (DF_Lookup), and the second one has correct demographics (DFI) but some of the $ and units sold are incorrect. So, I wrote the following code to handle this.

Here are my data:

DFI

dput(DFI)
structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1, 
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA", 
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")

DF_Lookup

dput(DF_Lookup)

structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345", 
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3", 
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011, 
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700", 
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100), 
    Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA, 
7L), class = "data.frame")

First Try:

Strategy: - Use Join to overwrite the entries from DF_Lookup in DFI

DF_Generated <- DFI %>% 
  left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
  dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
  dplyr::mutate(Count = n()) %>%
  dplyr::ungroup()%>%
  dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
  dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)

Expected Output:

dput(DF_Generated)
structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Location1 = c("MA", "NY", "WA", "NY", "WA", "NY", "IL", 
"IL", "MN", "CA"), Revenue = c(16.6666666666667, 16.6666666666667, 
16.6666666666667, 35, 35, 35, 100, -50, 50, 100), Quantity = c(1, 
1, 1, 10, 10, 20, 20, -10, 20, 40)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -10L), .Names = c("PO_ID", "SO_ID", 
"F_Year", "Product_ID", "Location1", "Revenue", "Quantity"))

Challenge: This works well for smaller dataset. The original data I am dealing with has about ~90M records. So, above code takes forever.

Second Try: Hence, I thought of only updating those rows that have $ and units outside of +/-10% range.

Here's my code:

#Find out whether the numbers are within +/-10% range.
DF_Mod<-DFI %>%
  dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
  dplyr::summarise(Rev_agg = sum(Revenue), Qty_agg = sum(Quantity)) %>%
  left_join(DF_Lookup) %>%
  dplyr::rowwise() %>%
  #check for +/- 10% confidence interval
  dplyr::mutate(Compute = ifelse((abs(Rev_agg-Revenue)/Revenue <=0.1) & (abs(Qty_agg-Quantity)/Quantity <=0.1),"N","Y")) %>%
  dplyr::rowwise() %>%
  dplyr::ungroup() %>%
  dplyr::select(PO_ID:Product_ID,Compute) %>%
  dplyr::right_join(DFI)

#Now, filter Compute == "Y" and then do the join with DF_Lookup.
DF_Generated_2 <- DF_Mod %>% 
  dplyr::filter(Compute == "Y") %>%  
  left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
  dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
  dplyr::mutate(Count = n()) %>%
  dplyr::ungroup()%>%
  dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
  dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)

#Bind the rows
DF_Final <- rbind(DF_Generated_2,DFI[DF_Mod$Compute=="N",]) #Expected output

Here, DF_Final is indeed the expected output.

Question: Even after following the above approach, the performance is very slow because of so many joins involved. Is there anyway we can speed up this process? Are there any other better methods to do what I am trying to do?

I'd appreciate your thoughts. I have spent one day on this, and am still nowhere. I am really stuck.

Have you tried joining with functions from other packages? http://stackoverflow.com/questions/4322219/whats-the-fastest-way-to-merge-join-data-frames-in-r — coletl, Feb 23 '17 at 07:18
@coletl - Thanks. I am not too familiar with doing joins using `data.table` and `sqldf`. If someone can guide me, it will really help me. — watchtower, Feb 23 '17 at 07:25

score 0 · Answer 1 · answered Feb 23 '17 at 09:26

I havent tested this on a large dataset, but still it might be what you need (and there might be still much faster ways of doing this):

# load data table library
library(data.table)

# convert data frames to data tables
DFI <- data.table(DFI)
DF_Lookup <- data.table(DF_Lookup)

# left join
df <- merge(DFI, DF_Lookup, all.x = TRUE, by = c("PO_ID", "SO_ID", "F_Year", "Product_ID"))

# Calculate the strange quantity and revenue
df2 <- df[, list(Revenue = Revenue.y/.N, Quantity = Quantity.y/.N),
              by = list(PO_ID, SO_ID, F_Year, Product_ID)]

score 0 · Answer 2 · answered Feb 23 '17 at 10:20

hopefully i interpret your problem correctly.

Using correct demographics from DFI to perform a lookup using data.table

library(data.table)
setDT(DFI)
setDT(DF_Lookup)
cols <- c("PO_ID", "SO_ID", "F_Year", "Product_ID")
DF_Lookup[DFI, Location:=Location1, on=cols]

calculate average revenue and quantity by PO_ID, SO_ID, F_Year, Product_ID, Location

DF_Lookup[, list(AveRevenue=mean(Revenue), AveQuantity=mean(Quantity)), 
    by=c(cols, "Location")]

you might also want to explore faster versions of mean in other packages (search SO)

Correcting / Tidying Data using R

2 Answers2