Join on 4 variables then group on fewer variables using data.table

Question

This thread is a continuation of my earlier thread Join then mutate using data.table without intermediate table.

In that thread, I am using look-up table to change revenue and quantity and then dividing the result by .N so that when I aggregate the products, I don't see inflated values.

As per recommendation from the expert on that thread, I don't want to count on all the four variables used for join i.e. PO_ID, SO_ID, F_Year, Product_ID but only SO_ID, F_Year, Product_ID.

Question: how can I do this using data.table?

Here are my data and code:

Here are my data and solution using dplyr

Input

DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S2", "S2", "S2", "S2", "S3", "S3", "S7", "S10"), F_Year = c(2012, 
2012, 2013, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "450X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1, 
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA", 
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")

Look Up Table

DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345", 
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3", 
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011, 
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700", 
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100), 
    Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA, 
7L), class = "data.frame")

Here's my modified code using dplyr:

DF_Generated <- DFI %>% 
  left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
  dplyr::group_by(SO_ID, F_Year, Product_ID) %>%
  dplyr::mutate(Count = n()) %>%
  dplyr::ungroup()%>%
  dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
  dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)

Please note that input to group_by has changed.

Expected output:

DF_Generated = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S2", "S2", "S2", "S2", "S3", "S3", "S7", "S10"), F_Year = c(2012, 
2012, 2013, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "450X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Location1 = c("MA", "NY", "WA", "NY", "WA", "NY", "IL", 
"IL", "MN", "CA"), Revenue = c(25, 25, 23.3333333333333, 23.3333333333333, 
23.3333333333333, 35, 50, 50, 50, 100), Quantity = c(1.5, 1.5, 
6.66666666666667, 6.66666666666667, 6.66666666666667, 20, 10, 
10, 20, 40)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-10L), .Names = c("PO_ID", "SO_ID", "F_Year", "Product_ID", "Location1", 
"Revenue", "Quantity"))

NOTE: Please note that I don't want to create intermediate variable because the actual data size is so large that this may not be feasible.

@mtoto - I tried using `dplyr`, something I have posted above. I am still trying to do it using `data.table` as we speak, and keep getting errors. As I have stated, I am a beginner and not really familiar with `data.table`. — watchtower, Mar 03 '17 at 21:28
Joining on the fourth column doesn't do anything meaningful, I guess. If I just leave it out of the join, I get your desired result: `DFI[DF_Lookup, on=.(SO_ID, F_Year, Product_ID), \`:=\`(nr = i.Revenue/.N, nq = i.Quantity/.N), by=.EACHI][]` — Frank, Mar 03 '17 at 21:30
@Frank - Thanks for your help, but that is dependent on those three columns being a composite key. In my actual data that is not the case. Do you want me to change the lookup table and re-post it? — watchtower, Mar 03 '17 at 21:33
Sure, I'm not a sql guru and so don't know how the compositeness is important here. — Frank, Mar 03 '17 at 21:34
Sure. I think what I am trying to state is that joining on 3 columns would work only if those 3 columns could uniquely identify a row (to prevent duplication of row). In my actual table that is not the case. 4 columns uniquely identify a row, and then I need to use 3 columns to group the columns. — watchtower, Mar 03 '17 at 21:35

Kristofersen · Accepted Answer · 2017-03-03T21:42:24.103

1

This should do what you're looking for

library(data.table)
setDT(DFI)
DFI[ , c("Revenue", "Quantity") := NULL]

setDT(DF_Lookup)

dat = merge(DF_Lookup, DFI, by = c("PO_ID", "SO_ID", "F_Year", "Product_ID"))
dat = dat[ , .(Revenue = Revenue/.N, Quantity = Quantity/.N, Location1), by = .(PO_ID, SO_ID, F_Year, Product_ID)]

dat
    PO_ID SO_ID F_Year Product_ID   Revenue  Quantity Location1
 1: P1234    S1   2012       385X  25.00000  1.500000        MA
 2: P1234    S1   2012       385X  25.00000  1.500000        NY
 3: P1234    S2   2013       450X  23.33333  6.666667        WA
 4: P1234    S2   2013       450X  23.33333  6.666667        NY
 5: P1234    S2   2013       450X  23.33333  6.666667        WA
 6: P1234    S2   2013       900X  35.00000 20.000000        NY
 7: P2345    S3   2011       3700  50.00000 10.000000        IL
 8: P2345    S3   2011       3700  50.00000 10.000000        IL
 9: P3456    S7   2014       A11U  50.00000 20.000000        MN
10: P4567   S10   2015       2700 100.00000 40.000000        CA

edited Mar 03 '17 at 21:42

answered Mar 03 '17 at 21:36

Kristofersen

2,736
1
15
31

Thanks for your help. I believe in your second line you meant to do `by = .(SO_ID, F_Year, Product_ID)` [There is no `PO_ID`]. Right? – watchtower Mar 03 '17 at 21:38
@watchtower i'm not sure what you mean. There is a PO_ID in your expected output, and my output matches yours. – Kristofersen Mar 03 '17 at 21:40
2

`length(some_col)` is `.N` when used there. – Frank Mar 03 '17 at 21:40
@Frank thanks, i should have caught that. I thought he was using unique of location before and i swapped out uniqueN with length... whoops. i'll fix it – Kristofersen Mar 03 '17 at 21:41
@Kristofersen - Is it possible to do what you have done without intermediate table `dat`? Also, to your other point: I meant that the grouping in your second step could should be like: `dat = dat[ , .(Revenue = Revenue.x/.N, Quantity = Quantity.x/.N, Location1), by = .(SO_ID, F_Year, Product_ID)]` i.e. everything you have done without `PO_ID` Please see my `dplyr` code for grouping. – watchtower Mar 03 '17 at 21:45
1

@watchtower Sure, we can just save everything as DF_Lookup. so just write `DF_Lookup = merge(DF_Lookup, ...)` and then replace all of the `dat` with `DF_Lookup` – Kristofersen Mar 03 '17 at 21:46
1

@watchtower if you just leave PO_ID out of the code you will no longer have that column at the end. – Kristofersen Mar 03 '17 at 21:47
Kristofersen - I didn't realize that. You are correct. Is there any way I can group on 3 variables and still have `PO_ID` in the output? – watchtower Mar 03 '17 at 21:49
1

@watchtower you could do this `dat = dat[ , .(PO_ID,Revenue = Revenue/length(Location1), Quantity = Quantity/length(Location1), Location1), by = .(SO_ID, F_Year, Product_ID)]` – Kristofersen Mar 03 '17 at 21:50

Join on 4 variables then group on fewer variables using data.table

1 Answers1

Linked