The tidyr
package has the very useful complete
function, which allows us to make "missing" rows explicit: you pass all the relevant columns to the function and it makes sure all possible combinations exist, creating new rows if needed with the missing combinations. You can also give it the expected values of a given column (i.e. calendar_quarter = 1:4
).
There's also nesting()
, which can be used within complete()
to group columns so that only the existing combinations of those columns are used.
However, what about an "intermediate" case? That is, one where we want to give different expected values for each group? This is best shown with an example:
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
})
df <- tribble(
~id, ~year, ~semester,
1, 2000, 1,
1, 2001, 1,
1, 2001, 2,
2, 1999, 1,
2, 1999, 2,
2, 2001, 1
)
Here we have a table with years and semesters where a given id
was relevant. However, we are missing some semesters (2000S1 for id=1
and 2000S1, 2000S2 and 2001S2 for id=2
).
First we try using nesting()
, since we'll otherwise create rows for id=1, year=1999
, which we don't need:
complete(df, nesting(id, year), semester = 1:2)
#> # A tibble: 8 x 3
#> id year semester
#> <dbl> <dbl> <dbl>
#> 1 1 2000 1
#> 2 1 2000 2 # new
#> 3 1 2001 1
#> 4 1 2001 2
#> 5 2 1999 1
#> 6 2 1999 2
#> 7 2 2001 1
#> 8 2 2001 2 # new
This is ok, filling out some missing semesters for both id
s. However, it still hides the missing year 2000 for id=2
. This is natural, since we didn't tell complete()
to use any extra values for the year
column.
So let's do that. We have to ditch nesting()
, and then we can tell complete()
to check for the existence of the entire range of known years:
complete(df, id, year = min(year):max(year), semester = 1:2)
#> # A tibble: 12 x 3
#> id year semester
#> <dbl> <dbl> <dbl>
#> 1 1 1999 1 # new, unwanted
#> 2 1 1999 2 # new, unwanted
#> 3 1 2000 1
#> 4 1 2000 2 # new
#> 5 1 2001 1
#> 6 1 2001 2
#> 7 2 1999 1
#> 8 2 1999 2
#> 9 2 2000 1 # new
#> 10 2 2000 2 # new
#> 11 2 2001 1
#> 12 2 2001 2 # new
Created on 2020-10-22 by the reprex package (v0.3.0)
This is better, showing all the missing semesters, including the year 2000 for id=2
.
However, it came at the cost of also exposing the year 1999 for id=1
, which we don't want.
So, is there an elegant way of getting the following output?
#> # A tibble: 10 x 3
#> id year semester
#> <dbl> <dbl> <dbl>
#> 1 1 2000 1
#> 2 1 2000 2 # new
#> 3 1 2001 1
#> 4 1 2001 2
#> 5 2 1999 1
#> 6 2 1999 2
#> 7 2 2000 1 # new
#> 8 2 2000 2 # new
#> 9 2 2001 1
#> 10 2 2001 2 # new
The only solution I've come up with would be to build another table startEndY
to store the initial and final years for each id
, run complete(df, id, year = min(year):max(year), semester = 1:2)
and then perform a join between df
and startEndY
to trim out the unwanted results, but that seems inelegant and slow, since the complete()
will create many unnecessary rows (especially in my actual use-case).