0

I have a dataset that look something like this:

Person   date     Amount
A       2019-01     900
A       2019-03     600
A       2019-04     300
A       2019-05       0
B       2019-04    1200
B       2019-07     800
B       2019-08     400
B       2019-09       0

As you'll notice in the "date" column, there are missing dates, such as '2019-02' for person A and '2019-05' and '2019-06' for person B. I would like to insert rows with the missing date and amount equal to the one before it (see expected result below).

I have tried performing group by but I don't know how to proceed from there. I've also tried converting the 'date' and 'amount' columns as lists, and from there fill in the gaps before putting them back to the dataframe. I was wondering if there is a more convenient way of doing this. In particular, getting the same results without having to extract lists from the original dataframe.

Ideally, I would want to having a dataframe that look something like this:

Person   date     Amount
A       2019-01     900
A       2019-02     900
A       2019-03     600
A       2019-04     300
A       2019-05       0
B       2019-04    1200
B       2019-05    1200
B       2019-06    1200
B       2019-07     800
B       2019-08     400
B       2019-09       0

I hope I was able to make my problem clear. Thanks in advance.

1 Answers1

2

We can first convert the date to actual date object (date1) by pasting "-01" at the end, then using complete we create a sequence of 1 month date objects for each Person. We then use fill to get Amount equal to the one before it and to get data in the original form we remove "-01" again from date1.

library(dplyr)
library(tidyr)

df %>%
  mutate(date1 = as.Date(paste0(date, "-01"))) %>%
  group_by(Person) %>%
  complete(date1 = seq(min(date1), max(date1), by = "1 month")) %>%
  fill(Amount) %>%
  mutate(date = sub("-01$", "", date1)) %>%
  select(-date1)

#  Person date    Amount
#  <fct>  <chr>    <int>
# 1 A      2019-01    900
# 2 A      2019-02    900
# 3 A      2019-03    600
# 4 A      2019-04    300
# 5 A      2019-05      0
# 6 B      2019-04   1200
# 7 B      2019-05   1200
# 8 B      2019-06   1200
# 9 B      2019-07    800
#10 B      2019-08    400
#11 B      2019-09      0

data

df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L), .Label = c("A", "B"), class = "factor"), date = structure(c(1L, 
2L, 3L, 4L, 3L, 5L, 6L, 7L), .Label = c("2019-01", "2019-03", 
"2019-04", "2019-05", "2019-07", "2019-08", "2019-09"), class = "factor"), 
Amount = c(900L, 600L, 300L, 0L, 1200L, 800L, 400L, 0L)), 
class = "data.frame", row.names = c(NA, -8L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213