4

How can I insert missing dates in a dataframe with more than 2 columns? In my data, each date has an observation between sp1 and sp2. If there is no observation between sp1 and sp2 on a day, that date is missing.

Here's what part of my df looks like:

the_date    sp1 sp2 win     loss    sp1_name    sp2_name
4/1/13      A   B   8       8         A_name    B_name
4/2/13      A   B   6       10        A_name    B_name
4/3/13      A   B   7       5         A_name    B_name
4/5/13      A   B   7       5         A_name    B_name
4/6/13      A   B   6       2         A_name    B_name
4/7/13      A   B   15      10        A_name    B_name
4/1/13      A   C   3       8         A_name    C_name
4/2/13      A   C   2       12        A_name    C_name
4/3/13      A   C   9       7         A_name    C_name
4/4/13      A   C   14      8         A_name    C_name
4/6/13      A   C   9       10        A_name    C_name
4/1/13      A   D   13      13        A_name    D_name
4/2/13      A   D   13      5         A_name    D_name
4/3/13      A   D   7       1         A_name    D_name
4/4/13      A   D   15      11        A_name    D_name
4/5/13      A   D   3       11        A_name    D_name
4/6/13      A   D   12      11        A_name    D_name
4/7/13      A   D   9       9         A_name    D_name

For instance, 4/4/13 for A-B is missing. What I want in my output is to insert those missing dates with all the corresponding columns and assign 0 to wins and losses. So my output will look like this the rows with * added):

the_date    sp1 sp2 win     loss    sp1_name    sp2_name
4/1/13      A   B   8       8         A_name    B_name
4/2/13      A   B   6       10        A_name    B_name
4/3/13      A   B   7       5         A_name    B_name
*4/4/13     A   B   0       0         A_name    B_name
4/5/13      A   B   7       5         A_name    B_name
4/6/13      A   B   6       2         A_name    B_name
4/7/13      A   B   15      10        A_name    B_name
4/1/13      A   C   3       8         A_name    C_name
4/2/13      A   C   2       12        A_name    C_name
4/3/13      A   C   9       7         A_name    C_name
4/4/13      A   C   14      8         A_name    C_name
*4/5/13     A   C   0       0         A_name    C_name
4/6/13      A   C   9       10        A_name    C_name
*4/7/13     A   C   0       0         A_name    C_name
4/1/13      A   D   13      13        A_name    D_name
4/2/13      A   D   13      5         A_name    D_name
4/3/13      A   D   7       1         A_name    D_name
4/4/13      A   D   15      11        A_name    D_name
4/5/13      A   D   3       11        A_name    D_name
4/6/13      A   D   12      11        A_name    D_name
4/7/13      A   D   9       9         A_name    D_name

I know that if we have a 2 column dataframe (values, date), we can pad the dataframe with missing dates by merging it with a full-range time. However, my dataframe has more than 2 columns.

Also, this is just part of my data, so I have other combinations for other dates:

sp1 sp2 
B    C
B    A
B    D
C    A
C    B
C    D
D    B
D    C
D    A

Any clues?

asiehh
  • 553
  • 12
  • 22
  • `rbind` and then sort data.frame by `the_date`, `sp1` and `sp2`? Is the range always 1-7? – romants Nov 16 '15 at 18:34
  • rbind which 2 dataframes? No, 1-7 is an example for this example. In reality, my data ranges over a few years. – asiehh Nov 16 '15 at 18:52
  • data.frame and missing rows – romants Nov 16 '15 at 18:53
  • I don't know what the missing rows are. My data has about 65,000 rows! – asiehh Nov 16 '15 at 18:54
  • Okay that makes sense. I guess that bring to the second part of my question - can you please explain the pattern a bit more clear? – romants Nov 16 '15 at 18:56
  • For each sp1-sp2 pairing, I want to have all the dates within the range of min(date) and max(date) in the output. The values for wins and losses for these missing dates has to be 0. So let's say if I have 4 distinct sp (A,B, C, D), I will have 4^2=16 different combinations for each date. Some of these combinations are present in my data, some are not. For those that are absent, I want to insert a row with wins=losses=0. I hope it's clear – asiehh Nov 16 '15 at 18:59

2 Answers2

4

This is a dplyr approach. You may want to consider a data.table approach given that you have a large data set.

d <- read.table(textConnection("the_date    sp1 sp2 win     loss    sp1_name    sp2_name
4/1/13      A   B   8       8         A_name    B_name
4/2/13      A   B   6       10        A_name    B_name
4/3/13      A   B   7       5         A_name    B_name
4/5/13      A   B   7       5         A_name    B_name
4/6/13      A   B   6       2         A_name    B_name
4/7/13      A   B   15      10        A_name    B_name
4/1/13      A   C   3       8         A_name    C_name
4/2/13      A   C   2       12        A_name    C_name
4/3/13      A   C   9       7         A_name    C_name
4/4/13      A   C   14      8         A_name    C_name
4/6/13      A   C   9       10        A_name    C_name
4/1/13      A   D   13      13        A_name    D_name
4/2/13      A   D   13      5         A_name    D_name
4/3/13      A   D   7       1         A_name    D_name
4/4/13      A   D   15      11        A_name    D_name
4/5/13      A   D   3       11        A_name    D_name
4/6/13      A   D   12      11        A_name    D_name
4/7/13      A   D   9       9         A_name    D_name"),
stringsAsFactors = FALSE, header = TRUE)

d$the_date <- as.Date(d$the_date, "%m/%d/%y")

Update

I realize that my original answer below is not completely correct. For example, it did not pad group A C with a 4/7/13. With that in mind, I came up with a better and I think faster approach.

#Step one combine sp1 and sp2 into one group
d$group <- paste0(d$sp1,d$sp2)

#Step two find min and max date in the database

min_d <- min(d$the_date)
max_d <- max(d$the_date)

#Step three use dplyr
d %>%
  do(expand.grid(unique(.$group), seq(min_d, max_d, 1))) %>% 
  rename(group = Var1, the_date = Var2) %>%
  left_join(d) %>%
  arrange(group) %>%
  select(-group)

Original

#Step one combine sp1 and sp2 into one group
d$group <- paste0(d$sp1,d$sp2)

#Step two use dplyr.  
d %>%
  group_by(group) %>%
  summarise(min = min(the_date), max = max(the_date)) %>%
  rowwise() %>%
  do(data.frame(group = .$group, the_date = seq(.$min, .$max, 1))) %>%
  left_join(d) %>%
  select(-group)

In general your question is similar to this one. Check it out for more information/ideas.

Community
  • 1
  • 1
Jacob H
  • 4,317
  • 2
  • 32
  • 39
  • Thanks Jacob. Could you explain what the 'do' line does? – asiehh Nov 16 '15 at 21:57
  • Read this http://www.inside-r.org/node/230616. `Do` basically allows you to call any function (i.e. not just 'mutate', 'filter', etc...) in an `dplyr` context. Specifically, in this example, I use `do` to construct a new `data.frame` that contains two variables, the group and the expanded date variable (i.e a date variable that include the missing dates). – Jacob H Nov 16 '15 at 22:16
2

Here's a solution using pad and fill_by_value from padr:

library(dplyr)
library(tidyr)
library(padr)

df %>%
  mutate(the_date = as.Date(the_date, "%m/%d/%y")) %>%
  group_by(sp1, sp2) %>%
  pad() %>%              
  fill(sp1_name:sp2_name) %>%      
  fill_by_value(win, loss)

Result:

# A tibble: 20 x 7
# Groups:   sp1, sp2 [3]
     the_date    sp1    sp2   win  loss sp1_name sp2_name
       <date> <fctr> <fctr> <dbl> <dbl>   <fctr>   <fctr>
 1 2013-04-01      A      B     8     8   A_name   B_name
 2 2013-04-02      A      B     6    10   A_name   B_name
 3 2013-04-03      A      B     7     5   A_name   B_name
 4 2013-04-04      A      B     0     0   A_name   B_name
 5 2013-04-05      A      B     7     5   A_name   B_name
 6 2013-04-06      A      B     6     2   A_name   B_name
 7 2013-04-07      A      B    15    10   A_name   B_name
 8 2013-04-01      A      C     3     8   A_name   C_name
 9 2013-04-02      A      C     2    12   A_name   C_name
10 2013-04-03      A      C     9     7   A_name   C_name
11 2013-04-04      A      C    14     8   A_name   C_name
12 2013-04-05      A      C     0     0   A_name   C_name
13 2013-04-06      A      C     9    10   A_name   C_name
14 2013-04-01      A      D    13    13   A_name   D_name
15 2013-04-02      A      D    13     5   A_name   D_name
16 2013-04-03      A      D     7     1   A_name   D_name
17 2013-04-04      A      D    15    11   A_name   D_name
18 2013-04-05      A      D     3    11   A_name   D_name
19 2013-04-06      A      D    12    11   A_name   D_name
20 2013-04-07      A      D     9     9   A_name   D_name

Data:

df = structure(list(the_date = structure(c(1L, 2L, 3L, 5L, 6L, 7L, 
1L, 2L, 3L, 4L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("4/1/13", 
"4/2/13", "4/3/13", "4/4/13", "4/5/13", "4/6/13", "4/7/13"), class = "factor"), 
    sp1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), 
    sp2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("B", "C", "D"
    ), class = "factor"), win = c(8L, 6L, 7L, 7L, 6L, 15L, 3L, 
    2L, 9L, 14L, 9L, 13L, 13L, 7L, 15L, 3L, 12L, 9L), loss = c(8L, 
    10L, 5L, 5L, 2L, 10L, 8L, 12L, 7L, 8L, 10L, 13L, 5L, 1L, 
    11L, 11L, 11L, 9L), sp1_name = structure(c(1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A_name", class = "factor"), 
    sp2_name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("B_name", 
    "C_name", "D_name"), class = "factor")), .Names = c("the_date", 
"sp1", "sp2", "win", "loss", "sp1_name", "sp2_name"), class = "data.frame", row.names = c(NA, 
-18L))
acylam
  • 18,231
  • 5
  • 36
  • 45