Join then mutate using data.table without intermediate table

Question

I am a beginner in data.table and searched around to do join and then mutate columns. I found data.table join then add columns to existing data.frame without re-copy thread, but I was not able to proceed further.

Please note that I am able to what I want to do using dplyr, but it's not feasible to run this code on the actual data because of the size. Plus, for aforementioned reason, I cannot create intermediate tables.

Here are my data and solution using dplyr

Input

DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1, 
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA", 
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")

Look Up Table

DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345", 
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3", 
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011, 
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700", 
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100), 
    Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA, 
7L), class = "data.frame")

Output

DFO = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Revenue = c(16.6666666666667, 16.6666666666667, 16.6666666666667, 
35, 35, 35, 100, -50, 50, 100), Quantity = c(1, 1, 1, 10, 10, 
20, 20, -10, 20, 40), Location1 = c("MA", "NY", "WA", "NY", "WA", 
"NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", "SO_ID", 
"F_Year", "Product_ID", "Revenue", "Quantity", "Location1"), row.names = c(NA, 
10L), class = "data.frame")

Here's my code using dplyr

I am using two libraries here: dplyr and compare

I am using left join to add new entries from the Look Up table into DFI. Then I am dividing the revenue and column based on the number of rows in a group. This is because I want to prevent inflation of numbers when grouped.

DF_Generated <- DFI %>% 
  dplyr::left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
  dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
  dplyr::mutate(Count = n()) %>%
  dplyr::ungroup()%>%
  dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
  dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)

Here's how the output matches:

compare(DF_Generated,DFO,allowAll = TRUE)
TRUE

I'd sincerely appreciate any help.

Ok. All I am doing is a left join, and then mutating columns in the result. Nevertheless, let me write a few lines. I have added the logic. — watchtower, Mar 03 '17 at 20:27

Frank · Accepted Answer · 2017-03-03T20:48:30.153

3

It's more efficient to simply add columns to DFI (in an "update join"), rather than making a new table:

DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID), 
  `:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N)
, by=.EACHI]

    PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1    newrev newqty
 1: P1234    S1   2012       385X       1        1        MA  16.66667      1
 2: P1234    S1   2012       385X       2        2        NY  16.66667      1
 3: P1234    S1   2012       385X       3        3        WA  16.66667      1
 4: P1234    S2   2013       450X      34        8        NY  35.00000     10
 5: P1234    S2   2013       450X      34        8        WA  35.00000     10
 6: P1234    S2   2013       900X       6        6        NY  35.00000     20
 7: P2345    S3   2011       3700       7        7        IL 100.00000     20
 8: P2345    S4   2011       3700      88        8        IL -50.00000    -10
 9: P3456    S7   2014       A11U       9        9        MN  50.00000     20
10: P4567   S10   2015       2700     100       40        CA 100.00000     40

This is a pretty natural extension of the Q&A linked in the OP.

The by=.EACHI groups by each row of i in x[i,on=,j]; and .N is how many rows the group has.

If you want the rev and qty cols overwritten, use `:=`(Revenue = i.Revenue/.N, Quantity = i.Quantity/.N).

edited Mar 03 '17 at 20:48

answered Mar 03 '17 at 20:34

Frank

66,179
8
96
180

Fantastic. Thanks. How do I get rid of Old Revenue and Quantity? I'd appreciate if you could add that piece as well. – watchtower Mar 03 '17 at 20:36
1

@watchtower You could use their names inside the `\`:=\`()` and those columns would be overwritten. (I'm usually hesitant to overwrite like that.) – Frank Mar 03 '17 at 20:39
Thanks Frank. It's very helpful. I am reading a thread on `.EACHI`. How do I know what's being grouped on in your code above? As in, I guess it is `PO_ID`, `SO_ID`, `F_Year`, `Product_ID`? If so, how do I modify your code to group only 3 of the 4 columns, say `PO_ID`, `SO_ID`, `F_Year`? I know this is a different question, and I can create a new thread for it. Please let me know. – watchtower Mar 03 '17 at 20:42
@watchtower It is not grouping on those variables but rather on each row of the table in the `i` position. You can have rows repeating there, in principle, and each will be treated separately by .EACHI. For example, I can join on the first row twice: `DFI[DF_Lookup[c(1,1)], on=.(PO_ID, SO_ID, F_Year, Product_ID), .N, by=.EACHI]`. Regarding how to join on one set of columns but count based on a smaller set of columns, yeah, unfortunately, that feature is not available in update joins yet: https://github.com/Rdatatable/data.table/issues/733 – Frank Mar 03 '17 at 20:46
1

Well, actually, there are some pretty simple workarounds for that case (considering you're only using `.N`), but yeah, I think it needs a separate question. – Frank Mar 03 '17 at 20:47
Done. http://stackoverflow.com/questions/42588378/join-on-4-variables-then-group-on-fewer-variables-using-data-table – watchtower Mar 03 '17 at 21:22

Join then mutate using data.table without intermediate table

1 Answers1

Linked