1

I have an R data.table. I'd like to copy or carry-forward the value from one column and row to a different column and row based on a key. I've done this with merge and rbindlist, but I'm wondering if there is a more parsimonious solution.

Here's a toy example: Let's say I'm tracking opening and closing balances for different checking accounts on different dates. I want to carry-forward the previous day's (t-1) closing balance to current day's (t) opening balance.

Have this input:

> DT_in <- data.table(date = c("2017-12-29", "2017-12-29", "2017-12-29", "2018-01-02", 
+                         "2018-01-02", "2018-01-02", "2018-01-02", "2018-01-03", 
+                         "2018-01-03", "2018-01-03", "2018-01-03"), 
+                  account_id = c("A17", "A23", "B21", "A17", "A23", "B21", "C12", 
+                                 "A17", "A23", "B21", "C12"),
+                  opening_balance = NA,
+                  ending_balance = c(224, 254, 240, 290, 107, 272, 105, 256, 215, 
+                                     202, 238)
+                  )
> DT_in
          date account_id opening_balance ending_balance
 1: 2017-12-29        A17              NA            224
 2: 2017-12-29        A23              NA            254
 3: 2017-12-29        B21              NA            240
 4: 2018-01-02        A17              NA            290
 5: 2018-01-02        A23              NA            107
 6: 2018-01-02        B21              NA            272
 7: 2018-01-02        C12              NA            105
 8: 2018-01-03        A17              NA            256
 9: 2018-01-03        A23              NA            215
10: 2018-01-03        B21              NA            202
11: 2018-01-03        C12              NA            238

Want this output:

> DT_out <- data.table(date = c("2017-12-29", "2017-12-29", "2017-12-29", "2018-01-02", 
+                           "2018-01-02", "2018-01-02", "2018-01-02", "2018-01-03", 
+                           "2018-01-03", "2018-01-03", "2018-01-03"), 
+                  account_id = c("A17", "A23", "B21", "A17", "A23", "B21", "C12", 
+                                 "A17", "A23", "B21", "C12"),
+                  opening_balance = c(NA, NA, NA, 224, 254, 240, NA, 290, 107, 272, 105), 
+                  ending_balance = c(224, 254, 240, 290, 107, 272, 105, 256, 215, 
+                                     202, 238)
+                  )
> DT_out
          date account_id opening_balance ending_balance
 1: 2017-12-29        A17              NA            224
 2: 2017-12-29        A23              NA            254
 3: 2017-12-29        B21              NA            240
 4: 2018-01-02        A17             224            290
 5: 2018-01-02        A23             254            107
 6: 2018-01-02        B21             240            272
 7: 2018-01-02        C12              NA            105
 8: 2018-01-03        A17             290            256
 9: 2018-01-03        A23             107            215
10: 2018-01-03        B21             272            202
11: 2018-01-03        C12             105            238

Note that account_ids are not necessarily persistent from one date to the next (new ones can added or old ones removed). While it should not matter, please also note that the dates are business dates and not calendar dates.

Ursus Frost
  • 405
  • 1
  • 7
  • 14
  • 1
    You should include R code to create your example (guidance here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250). Btw, you probably need tag data.table or DT rather than datatable. – Frank Oct 11 '18 at 19:31
  • 1
    @Frank Thanks. I've updated with code to create example and changed tag. – Ursus Frost Oct 11 '18 at 19:44
  • I've tried the `shift` function in `data.table` but can't get it to work. – Ursus Frost Oct 11 '18 at 19:46
  • @UrsusFrost Why does `shift` not work? – R Yoda Oct 11 '18 at 20:13
  • 1
    @RYoda I'm think `shift` will work. But, I'm just not doing it correctly. I've tried a few different permutations based on the `help` documentation associated with `shift`. For example, I've tried `DT_in[, opening_balance := shift(.SD, 1L), by = account_id]` but it results in the `date` column values being shifted into the `opening_balance` column. I've tried setting the `key` and other changes, but haven't figured it out yet. – Ursus Frost Oct 11 '18 at 20:26
  • 1
    `setkey(DT_in, account_id, date)` and `setorder(DT_in, account_id, date)` should work either – R Yoda Oct 11 '18 at 20:34
  • @RYoda Yes, you are correct. I was not using both the `account_id` and `date` to set the `key`. – Ursus Frost Oct 11 '18 at 20:38

1 Answers1

3

If you fix your example data to use the same data type of "opening_balance" and "ending_balance" columns via

...
opening_balance = NA_real_,
...

you can use this code to carry the last observation forward per group (after applying the right order/sorting of the rows!):

setorder(DT_in, account_id, date)
DT_in[, opening_balance := shift(ending_balance, 1), by = .(account_id)]

This results in (sorted output):

> DT_in
          date account_id opening_balance ending_balance
 1: 2017-12-29        A17              NA            224
 2: 2018-01-02        A17             224            290
 3: 2018-01-03        A17             290            256
 4: 2017-12-29        A23              NA            254
 5: 2018-01-02        A23             254            107
 6: 2018-01-03        A23             107            215
 7: 2017-12-29        B21              NA            240
 8: 2018-01-02        B21             240            272
 9: 2018-01-03        B21             272            202
10: 2018-01-02        C12              NA            105
11: 2018-01-03        C12             105            238
R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • Brilliant! Many thanks! Out of curiosity, why do you wrap the `account_id` in the `by =` clause with `.()`? It seems to work without the parenthetical wrapper, but I notice that `RStudio` finds and can auto-complete `account_id` when you wrap it in `.()` versus when you don't, so there's some nuance I'm missing. I haven't yet grasped the intuition for when to do `.()` or use `get()` or `eval()` when using `data.table`. – Ursus Frost Oct 11 '18 at 20:37
  • 1
    the `.()` syntax is just `data.table`s abreviation for `list()` and not required if you specify only one column. It is just my habit to always use it... `get` or `mget` is required if you want to read the name of one or multiple columns out of a variable (normally in the second "j" argument of a `data.table`). `eval` "executes" an R expression parsed via `quote` or `substitute` (rarely required and difficult to explain...) – R Yoda Oct 11 '18 at 21:36