1

I have a dataframe of about 8,000 country-year observations. I want to model the correlates of an event that occurs in certain country-years. To do so properly, I need to drop observations after the event starts until it ends. The events can vary in length from less than one year to 30 years. In my df, I have a column that identifies the threshold_year and termination_year for each event. This column obviously contains many NAs for those countries and years that do not experience the event.

How do I drop observations that fall between the threshold and termination years for specific countries? I have tried to execute the following but it yields an empty dataset: filtering observations from time series conditionally by group.

See code I have attempted below. (BTW, this is my first question on SO).

df <- structure(list(country_id = c(475, 150, 475, 475, 475, 475, 475, 150, 475, 475, 475), year = c(1962, 1967, 1964, 1965, 1966, 1967, 1968, 1968, 1970, 1971, 1972), event = c(0L, 0L, 0L, 0L, 1L, 3L, 0L, 0L, 0L, 0L, 0L), threshold_year = c(NA, NA, NA, NA, 1966, 1967, NA, NA, NA, NA, NA), termination_year = c(NA, NA, NA, NA, 1966, 1970, NA, NA, NA, NA, NA)), .Names = c("country_id", "year", "event", "threshold_year", "termination_year"), row.names = 90:100, class = "data.frame")

df2 <- df %>%
    group_by(country_id) %>%
    filter(year<=threshold_year & year>termination_year)

I expect a smaller df, perhaps with about 7,000 observations. My attempts typically produce 0 observations.

EDIT

I discovered an inelegant and clumsy process for resolving this issue. I joined my complete dataframe with my threshold dataframe by country only, not year. This adds a column with threshold and termination years for every country that has an event. It also creates a lot of duplicates, but that doesn't matter. Since I no longer have NAs in my threshold and termination columns, I can easily code a dummy variable for each observation to determine whether it falls within the threshold and termination years. I can also concatenate country ids and country years. Once I subset subset my lengthy dataframe by whether the dummy = 1, I can then easily create a list of all country-years that need to be dropped. I then go back to my original data and threshold data set, left_join by both country and year, then subset this data by !(df$country-year %in% drops).

df_drops <- left_join(df, threshold_df, by=c("id"="id"))
df_drops$drops <- ifelse(df_drops$year>df_drops$threshold_year & df_drops$year<=df_drops$termination_year, 1,0)
df_drops$obs_to_drop <- ifelse(df_drops$drops==1, paste(df_drops$id,df_drops$year, sep="_"), NA) 
drops <- unique(df_drops$obs_to_drop)

df2 <- left_join(df, threshold.df,by=c("id"="id","year"="threshold_year"))
df2$id_year <- paste(df2$id,df2$year,sep="_")
df3 <- subset(df2, !(df2$id_year %in% drops))
DOR
  • 9
  • 4
  • Please share a sample of your data. `dput()` is very nice for making a copy/pasteable version of your data, e.g., `dput(df[1:10, ])` for the first 10 rows. You probably just need to add an or condition into your `filter()`, something like `| is.na(threshold_year)`, but it's hard to know without seeing data. – Gregor Thomas Jul 16 '19 at 18:15
  • Thanks. I just tried to paste in a relevant sample. – DOR Jul 16 '19 at 18:28

2 Answers2

0

I am assuming that you have a list of thresholds that are specific to each group. If so, you can put the thresholds in a new data frame, then merge them with your original country-year data frame, and finally filter. My toy example below assumes that the end date is 2 years after the start date.

df <- data.frame(country=rep(letters[1:20],each=50),
                 year=sample(1999:2018,50,T))

threshold <- data.frame(country=letters[1:20],
                        start=as.numeric(sample(1999:2016,20,T))) %>% 
  mutate(end=start + 2)

df %>% left_join(.,threshold) %>% 
  filter(year>=start & year<=end)

   country year start  end
1        a 2016  2014 2016
2        a 2016  2014 2016
3        a 2014  2014 2016
4        a 2015  2014 2016
5        a 2015  2014 2016
6        a 2015  2014 2016
7        a 2014  2014 2016
8        b 2006  2004 2006
9        b 2004  2004 2006
10       b 2004  2004 2006
11       b 2005  2004 2006
12       b 2006  2004 2006
13       b 2004  2004 2006
14       b 2004  2004 2006
15       b 2006  2004 2006
16       b 2006  2004 2006
17       b 2006  2004 2006
18       b 2006  2004 2006
19       c 2010  2008 2010
20       c 2009  2008 2010
...
kstew
  • 1,104
  • 6
  • 21
0

I think my guess was right, simply adding | is.na(threshold_year) is enough, at least for the sample data provided.

df %>% group_by(country_id) %>%
  filter((year <= threshold_year & year > termination_year) | is.na(threshold_year))
# # A tibble: 9 x 5
# # Groups:   country_id [2]
#   country_id  year event threshold_year termination_year
#        <dbl> <dbl> <int>          <dbl>            <dbl>
# 1        475  1962     0             NA               NA
# 2        150  1967     0             NA               NA
# 3        475  1964     0             NA               NA
# 4        475  1965     0             NA               NA
# 5        475  1968     0             NA               NA
# 6        150  1968     0             NA               NA
# 7        475  1970     0             NA               NA
# 8        475  1971     0             NA               NA
# 9        475  1972     0             NA               NA
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Yes, this does the trick and feel a bit silly haven't overlooked this. Thank you! – DOR Jul 16 '19 at 19:10
  • Actually, I'm not sure this does resolve the issue. Reviewing df2, it looks like only observations with NA have been filtered, and so the first two steps (year <= threshold_year & year > termination_year) are ignored. – DOR Jul 16 '19 at 20:05
  • Summing the duration of all events longer than 1 year, it looks like my preference is to filter out about 970 country-year observations. – DOR Jul 16 '19 at 20:06