3

Consider a fragmented dataset like this:

   ID       Date Value
1   1 2012-01-01  5065
4   1 2012-01-04  1508
5   1 2012-01-05  9489
6   1 2012-01-06  7613
7   2 2012-01-07  6896
8   2 2012-01-08  2643
11  3 2012-01-02  7294
12  3 2012-01-03  8726
13  3 2012-01-04  6262
14  3 2012-01-05  2999
15  3 2012-01-06 10000
16  3 2012-01-07  1405
18  3 2012-01-09  8372

Notice that observations are missing for (2,3,9,10,17). What I would like, is to fill out some of these gaps in the dataset with "Value" = 0, like so:

   ID       Date Value
1   1 2012-01-01  5920
2   1 2012-01-02     0
3   1 2012-01-03     0
4   1 2012-01-04  8377
5   1 2012-01-05  7810
6   1 2012-01-06  6452
7   2 2012-01-07  3483
8   2 2012-01-08  5426
9   2 2012-01-09     0
11  3 2012-01-02  7854
12  3 2012-01-03  1948
13  3 2012-01-04  7141
14  3 2012-01-05  5402
15  3 2012-01-06  6412
16  3 2012-01-07  7043
17  3 2012-01-08     0
18  3 2012-01-09  3270

The point is that the zeros only should be inserted if there is a past observation for the same (grouped) ID. I would like to avoid any loops, as the full dataset is quite large.

Any suggestions? To reproduce the dataframe:

df <- data.frame(matrix(0, nrow = 18, ncol = 3,
                  dimnames = list(NULL, c("ID","Date","Value"))) )
df[,1] = c(1,1,1,1,1,1,2,2,2,3,3,3,3,3,3,3,3,3) 
df[,2] = seq(as.Date("2012-01-01"),
             as.Date("2012-01-9"), 
             by=1)
df[,3] = sample(1000:10000,18,replace=T)
df = df[-c(2,3,9,10,17),]
Lucas E
  • 105
  • 2
  • 9
  • Do you want to do this by groups of `ID`? – Rui Barradas Dec 11 '18 at 17:57
  • I suppose that would make the most sense, yes. – Lucas E Dec 11 '18 at 17:59
  • This question will help :https://stackoverflow.com/questions/53674579/how-to-check-if-an-id-comes-into-data-on-a-particular-date-that-it-stays-until-a/53674919#53674919 – Mike Dec 11 '18 at 18:00
  • See also [Fastest way to add rows for missing values in a data.frame?](https://stackoverflow.com/questions/10438969/fastest-way-to-add-rows-for-missing-values-in-a-data-frame) – Henrik Dec 11 '18 at 18:49

3 Answers3

6

Tidyverse has complete which is a nice easy way to expand something like this. We can also use the fill argument to replace the NAs with zero in the same step.

library(tidyverse)

df %>% group_by(ID) %>% 
  complete(Date = seq(min(Date), max(Date), "day"), fill = list(Value = 0)) 

# A tibble: 16 x 3
# Groups:   ID [3]
      ID Date       Value
   <dbl> <date>     <dbl>
 1     1 2012-01-01  1047
 2     1 2012-01-02     0
 3     1 2012-01-03     0
 4     1 2012-01-04  8147
 5     1 2012-01-05  1359
 6     1 2012-01-06  1892
 7     2 2012-01-07  3362
 8     2 2012-01-08  8988
 9     3 2012-01-02  2731
10     3 2012-01-03  9794

...
Mako212
  • 6,787
  • 1
  • 18
  • 37
4

There are already some solid answers here, but I would recommend checking out the package padr.

library(dplyr)
library(padr)

df %>% 
  pad(start_val = as.Date("2012-01-01"),
      end_val =   as.Date("2012-01-09"),
      group = "ID") %>% 
  fill_by_value(Value)

The package gives some pretty intuitive functions for summarizing Date columns as well.

Dave Gruenewald
  • 5,329
  • 1
  • 23
  • 35
  • Interesting, is there a simple way to make the completed range dynamic based on the group? For example., the max value in `ID == 1` is `"2012-01-07"`, whereas for `ID == 3` the max value is `"2012-01-10"`. In this solution we uniformly expand each group, but can we dynamically expand each group based on the group `min/max` values? – Mako212 Dec 11 '18 at 18:16
  • Nice to know about package `padr`. – Rui Barradas Dec 11 '18 at 18:18
  • 1
    If you don't specify the `start_val` and the `end_val`, the `pad` function should automatically adjust to find the `min/max` date for each group! Also, `pad` will respect `dplyr::group_by` if you prefer - rather than specifying as an argument within `pad` – Dave Gruenewald Dec 11 '18 at 18:19
  • @DaveGruenewald Very nice, that's pretty slick – Mako212 Dec 11 '18 at 18:23
  • 1
    I like this, very readable and concise, glad to know about the package! `df %>% group_by(ID) %>% pad() %>% replace_na(list(Value = 0))` – Mako212 Dec 11 '18 at 18:29
  • @Mako212 glad to help out! – Dave Gruenewald Dec 11 '18 at 19:05
  • Indeed, this provides a very elegant solution to this problem, and many others, thank you! – Lucas E Dec 11 '18 at 19:54
3

The following is a base R solution. It uses split to divide the input into sub-dataframes and then lapply to process each of them.

result <- lapply(split(df, df$ID), function(DF){
  Date <- seq(min(DF$Date), max(DF$Date), by = "days")
  DF2 <- data.frame(ID = rep(DF$ID[1], length.out = length(Date)))
  DF2$Date <- Date
  DF2$Value <- 0
  DF2$Value[Date %in% DF$Date] <- DF$Value
  DF2
})

result <- do.call(rbind, result)
row.names(result) <- NULL
result
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66