I have a dataframe of about 8,000 country-year observations. I want to model the correlates of an event that occurs in certain country-years. To do so properly, I need to drop observations after the event starts until it ends. The events can vary in length from less than one year to 30 years. In my df, I have a column that identifies the threshold_year and termination_year for each event. This column obviously contains many NAs for those countries and years that do not experience the event.
How do I drop observations that fall between the threshold and termination years for specific countries? I have tried to execute the following but it yields an empty dataset: filtering observations from time series conditionally by group.
See code I have attempted below. (BTW, this is my first question on SO).
df <- structure(list(country_id = c(475, 150, 475, 475, 475, 475, 475, 150, 475, 475, 475), year = c(1962, 1967, 1964, 1965, 1966, 1967, 1968, 1968, 1970, 1971, 1972), event = c(0L, 0L, 0L, 0L, 1L, 3L, 0L, 0L, 0L, 0L, 0L), threshold_year = c(NA, NA, NA, NA, 1966, 1967, NA, NA, NA, NA, NA), termination_year = c(NA, NA, NA, NA, 1966, 1970, NA, NA, NA, NA, NA)), .Names = c("country_id", "year", "event", "threshold_year", "termination_year"), row.names = 90:100, class = "data.frame")
df2 <- df %>%
group_by(country_id) %>%
filter(year<=threshold_year & year>termination_year)
I expect a smaller df, perhaps with about 7,000 observations. My attempts typically produce 0 observations.
EDIT
I discovered an inelegant and clumsy process for resolving this issue. I joined my complete dataframe with my threshold dataframe by country only, not year. This adds a column with threshold and termination years for every country that has an event. It also creates a lot of duplicates, but that doesn't matter. Since I no longer have NAs in my threshold and termination columns, I can easily code a dummy variable for each observation to determine whether it falls within the threshold and termination years. I can also concatenate country ids and country years. Once I subset subset my lengthy dataframe by whether the dummy = 1, I can then easily create a list of all country-years that need to be dropped. I then go back to my original data and threshold data set, left_join by both country and year, then subset this data by !(df$country-year %in% drops).
df_drops <- left_join(df, threshold_df, by=c("id"="id"))
df_drops$drops <- ifelse(df_drops$year>df_drops$threshold_year & df_drops$year<=df_drops$termination_year, 1,0)
df_drops$obs_to_drop <- ifelse(df_drops$drops==1, paste(df_drops$id,df_drops$year, sep="_"), NA)
drops <- unique(df_drops$obs_to_drop)
df2 <- left_join(df, threshold.df,by=c("id"="id","year"="threshold_year"))
df2$id_year <- paste(df2$id,df2$year,sep="_")
df3 <- subset(df2, !(df2$id_year %in% drops))