0

I have a data frame with with multiple observations for multiple states on a yearly basis. The earliest observation for one state is in 1994 and the latest for most of them is in 2020. I have missing data in my, since most of the observations do not reach from 1994 to 2020. I would now like to expand my data frame and insert rows for the missing years. The other columns should just be filled with NAs. The data frame looks like this:

see table

My approach so far was:

relative_FTE %>% 
  group_by(canton_id) %>%
  mutate(Earliest.year = min(year)) %>%
  select(-value, -year) %>% 
  distinct() %>%
  expand(year = Earliest.year:1994, Earliest.year) %>%
  select(-Earliest.year) %>%
  left_join(relative_FTE, by = c("canton_id", "year"))

The code runs, I am, however, getting the warning messages:

1: In Earliest.year:1994 : numerical expression has 14 elements: only the first used 2: In Earliest.year:1994 : numerical expression has 16 elements: only the first used

So, the table only keeps the data from 1994 until the year of first observation for each state, but the rest won't be used any more. Could someone help me with finding a solution so that I have observations from 1994 to 2020 for each state? Help is much appreciated.

Kind regards

Frederik
  • 3
  • 1
  • 2
    Please provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Albin Jan 25 '22 at 09:24

1 Answers1

0

I guess you want to have rows for each pair of year (from 1994 to 2020) and canton_id. I think you can create full_df with the pairs and then merge it with you data.frame.

full_df <- list(canton_id = unique(relative_FTE$canton_id), year = 1994:2020) %>% expand.grid()
merge(relative_FTE, full_df, all = T, by = c("year","canton_id"))