2

Let's say I have a data_frame that looks like this:

dput(df)
structure(list(Name = c("John Smith", "John Smith", "John Smith", 
"John Smith", "John Smith"), Account_Number = c("XXXX XXXX 0000", 
"XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000"
), Transaction_Date = c("04/01/16", "04/02/16", "04/03/16", "04/04/16", 
"04/05/16"), Amount = c(NA, 749, -256, 392, NA), Balance = c(2000, 
NA, NA, NA, 1500)), .Names = c("Name", "Account_Number", "Transaction_Date", 
"Amount", "Balance"), row.names = c(NA, 5L), class = c("tbl_df", 
"tbl", "data.frame"))

For ease of viewing in the question, here it is printed:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749      NA
#3 John Smith XXXX XXXX 0000         04/03/16   -256      NA
#4 John Smith XXXX XXXX 0000         04/04/16    392      NA
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500

What I would like to do is fill in the NA values in Balance with the sum of Balance[i-1] + Amount[i]. I thought I could do this easily with dplyr using the following:

library(lubridate)
library(dplyr)
df %>%
  arrange(mdy(Transaction_Date)) %>%
  mutate(Balance = ifelse(is.na(Balance), as.numeric(lag(Balance)) + as.numeric(Amount), Balance))

Unfortunately, that gives me the following:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749    2749
#3 John Smith XXXX XXXX 0000         04/03/16   -256      NA
#4 John Smith XXXX XXXX 0000         04/04/16    392      NA
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500

So, it seems that all values are being calculated simultaneously, when what I'd want is for it to go row by row.

The desired outcome would look like this:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749    2749
#3 John Smith XXXX XXXX 0000         04/03/16   -256    2493
#4 John Smith XXXX XXXX 0000         04/04/16    392    2885
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500

I believe I could use apply, but I'd prefer to keep this in the dplyr pipeline if possible. Thanks in advance for any tips.

Update:

Based on this question, it looks like I could use RcppRoll::roll_sum, but that function looks like it only takes one variable whereas I'd need to use two. So I'd also accept an answer that demonstrates how to use that function.

Community
  • 1
  • 1
tblznbits
  • 6,602
  • 6
  • 36
  • 66

2 Answers2

5

Edit: Warning!

The original approach presented here will not handle resetting Balance properly, as you'll see if you pass it df %>% bind_rows(df). I am only leaving it here as it was the accepted answer. See below for an updated approach that avoids the problem.


Original [faulty] approach

You're really taking a cumulative sum, but using cumsum here is sort of a pain because it doesn't have an na.rm argument. However, you can remove and reinsert the NA values:

# replace NAs with 0s so cumsum will work
df %>% mutate_each(funs(ifelse(is.na(.), 0, .)), Balance, Amount) %>% 
    # replace 0 values in Balance with cumsum of Balance and Amount
    mutate(Balance = ifelse(Balance == 0, cumsum(Balance + Amount), Balance)) %>% 
    # put NAs back
    mutate(Amount = ifelse(Amount == 0, NA, Amount))

# Source: local data frame [5 x 5]
# 
#         Name Account_Number Transaction_Date Amount Balance
#        (chr)          (chr)            (chr)  (dbl)   (dbl)
# 1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
# 2 John Smith XXXX XXXX 0000         04/02/16    749    2749
# 3 John Smith XXXX XXXX 0000         04/03/16   -256    2493
# 4 John Smith XXXX XXXX 0000         04/04/16    392    2885
# 5 John Smith XXXX XXXX 0000         04/05/16     NA    1500

Note that if you have actual 0 values in Balance or Amount (or if that's possible), you may need to make the approach more robust.


New [functioning] approach

By grouping by the run length of when Amount is NA, we can make sure we're adding the correct cumulative sum, and not adding Amount values before a reset of Balance:

# pass it a bigger df to test
df %>% bind_rows(df) %>% 
    # replace NAs with last value
    tidyr::fill(Balance) %>% 
    # group so cumsums are not added after Balance reset
    group_by(NA_Amount = is.na(Amount), 
             rle_Amount = data.table::rleid(NA_Amount)) %>%
    mutate(Balance = ifelse(NA_Amount, Balance, Balance + cumsum(Amount))) %>%
    # clean up columns
    ungroup() %>% select(-NA_Amount, -rle_Amount)

# Source: local data frame [10 x 5]
# 
#          Name Account_Number Transaction_Date Amount Balance
#         (chr)          (chr)            (chr)  (dbl)   (dbl)
# 1  John Smith XXXX XXXX 0000         04/01/16     NA    2000
# 2  John Smith XXXX XXXX 0000         04/02/16    749    2749
# 3  John Smith XXXX XXXX 0000         04/03/16   -256    2493
# 4  John Smith XXXX XXXX 0000         04/04/16    392    2885
# 5  John Smith XXXX XXXX 0000         04/05/16     NA    1500
# 6  John Smith XXXX XXXX 0000         04/01/16     NA    2000
# 7  John Smith XXXX XXXX 0000         04/02/16    749    2749
# 8  John Smith XXXX XXXX 0000         04/03/16   -256    2493
# 9  John Smith XXXX XXXX 0000         04/04/16    392    2885
# 10 John Smith XXXX XXXX 0000         04/05/16     NA    1500
alistaire
  • 42,459
  • 4
  • 77
  • 117
3
library(data.table)

setDT(df)[, Balance := c(Balance[1], Balance[1] + cumsum(Amount[-1]))
          , by = cumsum(!is.na(Balance))][]
#         Name Account_Number Transaction_Date Amount Balance
#1: John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2: John Smith XXXX XXXX 0000         04/02/16    749    2749
#3: John Smith XXXX XXXX 0000         04/03/16   -256    2493
#4: John Smith XXXX XXXX 0000         04/04/16    392    2885
#5: John Smith XXXX XXXX 0000         04/05/16     NA    1500
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 5
    I'm beginning to think that I really need to start learning `data.table`. – tblznbits May 17 '16 at 19:49
  • 1
    @brittenb you can also translate the above to `dplyr` in a straightforward fashion, though you'll get an extra column from the grouping that you'll need to remove later + will copy the *entire* `data.frame` around a few times instead of the above in-place modification – eddi May 17 '16 at 19:54