0

How do I create counts of how many observations were active between start and end years grouped by an id variable?

I have a df whose unit of analysis is organizations. These organizations are active for only certain periods of time (startyear, endyear) and in certain countries (acr). Some countries have more than one organization active at the same time. I would like to create a variable that reflects the number of contemporaneously active organizations for each country (for each observation the answer will be at least 1 and could be several dozen). For each observation's startyear, I believe I would need to go over my data frame by country and count how many times the startyear falls between other observations startyear/endyear. How to do it?

I have attempted to do this with dplyr (per Count how many fall in each group in R):

nsa2 <- nsa %>%
  group_by(acr,startyear) %>%
  mutate(count=n())

Is there a way to use filter() or between(), even though the between years are not fixed but change? or even apply to go back over the df?

Here is some sample data with a few obs that overlap:

structure(list(acr = structure(c(8L, 10L, 11L, 7L, 6L, 4L, 7L, 
6L, 3L, 3L, 12L, 2L, 13L, 9L, 1L, 8L, 15L, 7L, 5L, 14L), .Label = c("AFG", 
"ARG", "CHA", "DRC", "IRQ", "ISR", "MYA", "RUS", "RWA", "SOM", 
"SUD", "THI", "UKG", "YUG", "ZIM"), class = "factor"), startyear = c(1994, 
2009, 1971, 1996, 1965, 2008, 2000, 1990, 1992, 1992, 2003, 1974, 
1957, 1997, 1996, 1946, 1976, 1996, 2004, 1998), endyear = c(1996, 
2010, 1971, 1996, 1973, 2008, 2000, 1999, 1994, 1994, 2011, 1977, 
1957, 2002, 1996, 1948, 1979, 1996, 2005, 1999), obsid = structure(c(5L, 
3L, 19L, 6L, 11L, 10L, 20L, 9L, 17L, 18L, 8L, 12L, 14L, 4L, 2L, 
16L, 1L, 13L, 15L, 7L), .Label = c("NSA.3.4-1030", "NSA.3.4-1126", 
"NSA.3.4-1219", "NSA.3.4-1381", "NSA.3.4-1546", "NSA.3.4-157", 
"NSA.3.4-1594", "NSA.3.4-1657", "NSA.3.4-1666", "NSA.3.4-1678", 
"NSA.3.4-271", "NSA.3.4-346", "NSA.3.4-376", "NSA.3.4-391", "NSA.3.4-406", 
"NSA.3.4-58", "NSA.3.4-679", "NSA.3.4-685", "NSA.3.4-892", "NSA.3.4-91"
), class = "factor")), row.names = c(NA, -20L), groups = structure(list(
    acr = structure(c(1L, 4L, 15L, 26L, 43L, 44L, 44L, 59L, 59L, 
    75L, 75L, 77L, 83L, 87L, 91L, 97L, 104L, 105L), .Label = c("AFG", 
    "ALG", "ANG", "ARG", "AZE", "BFO", "BNG", "BOL", "BOS", "BUI", 
    "CAM", "CAO", "CDI", "CEN", "CHA", "CHL", "CHN", "COL", "COM", 
    "CON", "COS", "CRO", "CUB", "DJI", "DOM", "DRC", "EGY", "ERI", 
    "ETH", "FRN", "GAB", "GAM", "GHA", "GNB", "GRC", "GRG", "GUA", 
    "GUI", "HAI", "IND", "INS", "IRN", "IRQ", "ISR", "KEN", "LAO", 
    "LBR", "LEB", "LES", "LIB", "MAA", "MAC", "MAG", "MAL", "MEX", 
    "MLD", "MLI", "MOR", "MYA", "MZM", "NEP", "NIC", "NIG", "NIR", 
    "NTH", "OMA", "PAK", "PAN", "PAR", "PER", "PHI", "PNG", "POR", 
    "RUM", "RUS", "RVN", "RWA", "SAF", "SAL", "SAU", "SEN", "SIE", 
    "SOM", "SPN", "SRI", "SSD", "SUD", "SUR", "SYR", "TAJ", "THI", 
    "TOG", "TRI", "TUN", "TUR", "UGA", "UKG", "URU", "USA", "UZB", 
    "VEN", "YEM", "YPR", "YUG", "ZIM"), class = "factor"), startyear = c(1996, 
    1974, 1992, 2008, 2004, 1965, 1990, 1996, 2000, 1946, 1994, 
    1997, 2009, 1971, 2003, 1957, 1998, 1976), .rows = list(15L, 
        12L, 9:10, 6L, 19L, 5L, 8L, c(4L, 18L), 7L, 16L, 1L, 
        14L, 2L, 3L, 11L, 13L, 20L, 17L)), row.names = c(NA, 
-18L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))
DOR
  • 9
  • 4
  • Not clear about your expecetd. Do you want `nsa %>% group_by(acr) %>% filter(all(startyear < endyear))` – akrun May 22 '20 at 21:40

2 Answers2

1

Maybe this will give you some ideas

check_overlap <- function(start,end){
  map2_dbl(start,end, ~sum((.x >= start & .x <= end) | (.y <= end & .y >= start)  )) 

}
nsa %>% 
  group_by(acr) %>% 
  mutate(overlap = check_overlap(startyear, endyear)) %>% 
  arrange(acr)

Hope this helps!!

Bertil Baron
  • 4,923
  • 1
  • 15
  • 24
  • I think this does the trick! Still trying to wrap my head around what map2_dbl() is doing, but greatly appreciate the help. – DOR May 22 '20 at 23:39
  • map2_dbl is a function from the purrr package. It is similar to sapply but iterates over 2 list or vectors. – Bertil Baron May 23 '20 at 06:14
0

We can try

library(dplyr)
nsa %>% 
   group_by(acr) %>%
   filter(all(startyear < endyear))
akrun
  • 874,273
  • 37
  • 540
  • 662