Let's say I have a data_frame
that looks like this:
dput(df)
structure(list(Name = c("John Smith", "John Smith", "John Smith",
"John Smith", "John Smith"), Account_Number = c("XXXX XXXX 0000",
"XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000"
), Transaction_Date = c("04/01/16", "04/02/16", "04/03/16", "04/04/16",
"04/05/16"), Amount = c(NA, 749, -256, 392, NA), Balance = c(2000,
NA, NA, NA, 1500)), .Names = c("Name", "Account_Number", "Transaction_Date",
"Amount", "Balance"), row.names = c(NA, 5L), class = c("tbl_df",
"tbl", "data.frame"))
For ease of viewing in the question, here it is printed:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 NA
#3 John Smith XXXX XXXX 0000 04/03/16 -256 NA
#4 John Smith XXXX XXXX 0000 04/04/16 392 NA
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
What I would like to do is fill in the NA
values in Balance
with the sum of Balance[i-1] + Amount[i]
. I thought I could do this easily with dplyr
using the following:
library(lubridate)
library(dplyr)
df %>%
arrange(mdy(Transaction_Date)) %>%
mutate(Balance = ifelse(is.na(Balance), as.numeric(lag(Balance)) + as.numeric(Amount), Balance))
Unfortunately, that gives me the following:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 2749
#3 John Smith XXXX XXXX 0000 04/03/16 -256 NA
#4 John Smith XXXX XXXX 0000 04/04/16 392 NA
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
So, it seems that all values are being calculated simultaneously, when what I'd want is for it to go row by row.
The desired outcome would look like this:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 2749
#3 John Smith XXXX XXXX 0000 04/03/16 -256 2493
#4 John Smith XXXX XXXX 0000 04/04/16 392 2885
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
I believe I could use apply
, but I'd prefer to keep this in the dplyr
pipeline if possible. Thanks in advance for any tips.
Update:
Based on this question, it looks like I could use RcppRoll::roll_sum
, but that function looks like it only takes one variable whereas I'd need to use two. So I'd also accept an answer that demonstrates how to use that function.