Complete a data.frame with new values by group

Question

The tidyr package has the very useful complete function, which allows us to make "missing" rows explicit: you pass all the relevant columns to the function and it makes sure all possible combinations exist, creating new rows if needed with the missing combinations. You can also give it the expected values of a given column (i.e. calendar_quarter = 1:4).

There's also nesting(), which can be used within complete() to group columns so that only the existing combinations of those columns are used.

However, what about an "intermediate" case? That is, one where we want to give different expected values for each group? This is best shown with an example:

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
})

df <- tribble(
  ~id, ~year, ~semester,
  1, 2000, 1,
  1, 2001, 1,
  1, 2001, 2,
  
  2, 1999, 1,
  2, 1999, 2,
  2, 2001, 1
)

Here we have a table with years and semesters where a given id was relevant. However, we are missing some semesters (2000S1 for id=1 and 2000S1, 2000S2 and 2001S2 for id=2).

First we try using nesting(), since we'll otherwise create rows for id=1, year=1999, which we don't need:

complete(df, nesting(id, year), semester = 1:2)
#> # A tibble: 8 x 3
#>      id  year semester
#>   <dbl> <dbl>    <dbl>
#> 1     1  2000        1
#> 2     1  2000        2  # new
#> 3     1  2001        1
#> 4     1  2001        2
#> 5     2  1999        1
#> 6     2  1999        2
#> 7     2  2001        1
#> 8     2  2001        2  # new

This is ok, filling out some missing semesters for both ids. However, it still hides the missing year 2000 for id=2. This is natural, since we didn't tell complete() to use any extra values for the year column.

So let's do that. We have to ditch nesting(), and then we can tell complete() to check for the existence of the entire range of known years:

complete(df, id, year = min(year):max(year), semester = 1:2)
#> # A tibble: 12 x 3
#>       id  year semester
#>    <dbl> <dbl>    <dbl>
#>  1     1  1999        1  # new, unwanted
#>  2     1  1999        2  # new, unwanted
#>  3     1  2000        1
#>  4     1  2000        2  # new
#>  5     1  2001        1
#>  6     1  2001        2
#>  7     2  1999        1
#>  8     2  1999        2
#>  9     2  2000        1  # new
#> 10     2  2000        2  # new
#> 11     2  2001        1
#> 12     2  2001        2  # new

^{Created on 2020-10-22 by the reprex package (v0.3.0)}

This is better, showing all the missing semesters, including the year 2000 for id=2.

However, it came at the cost of also exposing the year 1999 for id=1, which we don't want.

So, is there an elegant way of getting the following output?

#> # A tibble: 10 x 3
#>       id  year semester
#>    <dbl> <dbl>    <dbl>
#>  1     1  2000        1
#>  2     1  2000        2  # new
#>  3     1  2001        1
#>  4     1  2001        2
#>  5     2  1999        1
#>  6     2  1999        2
#>  7     2  2000        1  # new
#>  8     2  2000        2  # new
#>  9     2  2001        1
#> 10     2  2001        2  # new

The only solution I've come up with would be to build another table startEndY to store the initial and final years for each id, run complete(df, id, year = min(year):max(year), semester = 1:2) and then perform a join between df and startEndY to trim out the unwanted results, but that seems inelegant and slow, since the complete() will create many unnecessary rows (especially in my actual use-case).

score 3 · Accepted Answer · answered Oct 23 '20 at 01:57

You can complete the missing observations per id :

library(dplyr)

df %>% group_by(id) %>% tidyr::complete(year = min(year):max(year), semester)

#      id  year semester
#   <dbl> <dbl>    <dbl>
# 1     1  2000        1
# 2     1  2000        2
# 3     1  2001        1
# 4     1  2001        2
# 5     2  1999        1
# 6     2  1999        2
# 7     2  2000        1
# 8     2  2000        2
# 9     2  2001        1
#10     2  2001        2

Oh, wasn't aware that `complete` respected `group_by`! – Wasabi Oct 23 '20 at 02:00 — Wasabi, Oct 23 '20 at 02:00

Complete a data.frame with new values by group

1 Answers1

Linked

Related