Count how many observations fall between start, end values that vary by group

Question

How do I create counts of how many observations were active between start and end years grouped by an id variable?

I have a df whose unit of analysis is organizations. These organizations are active for only certain periods of time (startyear, endyear) and in certain countries (acr). Some countries have more than one organization active at the same time. I would like to create a variable that reflects the number of contemporaneously active organizations for each country (for each observation the answer will be at least 1 and could be several dozen). For each observation's startyear, I believe I would need to go over my data frame by country and count how many times the startyear falls between other observations startyear/endyear. How to do it?

I have attempted to do this with dplyr (per Count how many fall in each group in R):

nsa2 <- nsa %>%
  group_by(acr,startyear) %>%
  mutate(count=n())

Is there a way to use filter() or between(), even though the between years are not fixed but change? or even apply to go back over the df?

Here is some sample data with a few obs that overlap:

structure(list(acr = structure(c(8L, 10L, 11L, 7L, 6L, 4L, 7L, 
6L, 3L, 3L, 12L, 2L, 13L, 9L, 1L, 8L, 15L, 7L, 5L, 14L), .Label = c("AFG", 
"ARG", "CHA", "DRC", "IRQ", "ISR", "MYA", "RUS", "RWA", "SOM", 
"SUD", "THI", "UKG", "YUG", "ZIM"), class = "factor"), startyear = c(1994, 
2009, 1971, 1996, 1965, 2008, 2000, 1990, 1992, 1992, 2003, 1974, 
1957, 1997, 1996, 1946, 1976, 1996, 2004, 1998), endyear = c(1996, 
2010, 1971, 1996, 1973, 2008, 2000, 1999, 1994, 1994, 2011, 1977, 
1957, 2002, 1996, 1948, 1979, 1996, 2005, 1999), obsid = structure(c(5L, 
3L, 19L, 6L, 11L, 10L, 20L, 9L, 17L, 18L, 8L, 12L, 14L, 4L, 2L, 
16L, 1L, 13L, 15L, 7L), .Label = c("NSA.3.4-1030", "NSA.3.4-1126", 
"NSA.3.4-1219", "NSA.3.4-1381", "NSA.3.4-1546", "NSA.3.4-157", 
"NSA.3.4-1594", "NSA.3.4-1657", "NSA.3.4-1666", "NSA.3.4-1678", 
"NSA.3.4-271", "NSA.3.4-346", "NSA.3.4-376", "NSA.3.4-391", "NSA.3.4-406", 
"NSA.3.4-58", "NSA.3.4-679", "NSA.3.4-685", "NSA.3.4-892", "NSA.3.4-91"
), class = "factor")), row.names = c(NA, -20L), groups = structure(list(
    acr = structure(c(1L, 4L, 15L, 26L, 43L, 44L, 44L, 59L, 59L, 
    75L, 75L, 77L, 83L, 87L, 91L, 97L, 104L, 105L), .Label = c("AFG", 
    "ALG", "ANG", "ARG", "AZE", "BFO", "BNG", "BOL", "BOS", "BUI", 
    "CAM", "CAO", "CDI", "CEN", "CHA", "CHL", "CHN", "COL", "COM", 
    "CON", "COS", "CRO", "CUB", "DJI", "DOM", "DRC", "EGY", "ERI", 
    "ETH", "FRN", "GAB", "GAM", "GHA", "GNB", "GRC", "GRG", "GUA", 
    "GUI", "HAI", "IND", "INS", "IRN", "IRQ", "ISR", "KEN", "LAO", 
    "LBR", "LEB", "LES", "LIB", "MAA", "MAC", "MAG", "MAL", "MEX", 
    "MLD", "MLI", "MOR", "MYA", "MZM", "NEP", "NIC", "NIG", "NIR", 
    "NTH", "OMA", "PAK", "PAN", "PAR", "PER", "PHI", "PNG", "POR", 
    "RUM", "RUS", "RVN", "RWA", "SAF", "SAL", "SAU", "SEN", "SIE", 
    "SOM", "SPN", "SRI", "SSD", "SUD", "SUR", "SYR", "TAJ", "THI", 
    "TOG", "TRI", "TUN", "TUR", "UGA", "UKG", "URU", "USA", "UZB", 
    "VEN", "YEM", "YPR", "YUG", "ZIM"), class = "factor"), startyear = c(1996, 
    1974, 1992, 2008, 2004, 1965, 1990, 1996, 2000, 1946, 1994, 
    1997, 2009, 1971, 2003, 1957, 1998, 1976), .rows = list(15L, 
        12L, 9:10, 6L, 19L, 5L, 8L, c(4L, 18L), 7L, 16L, 1L, 
        14L, 2L, 3L, 11L, 13L, 20L, 17L)), row.names = c(NA, 
-18L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

Not clear about your expecetd. Do you want `nsa %>% group_by(acr) %>% filter(all(startyear < endyear))` — akrun, May 22 '20 at 21:40

score 1 · Accepted Answer · answered May 22 '20 at 22:03

1

Maybe this will give you some ideas

check_overlap <- function(start,end){
  map2_dbl(start,end, ~sum((.x >= start & .x <= end) | (.y <= end & .y >= start)  )) 

}
nsa %>% 
  group_by(acr) %>% 
  mutate(overlap = check_overlap(startyear, endyear)) %>% 
  arrange(acr)

Hope this helps!!

answered May 22 '20 at 22:03

Bertil Baron

4,923
1
15
24

I think this does the trick! Still trying to wrap my head around what map2_dbl() is doing, but greatly appreciate the help. – DOR May 22 '20 at 23:39
map2_dbl is a function from the purrr package. It is similar to sapply but iterates over 2 list or vectors. – Bertil Baron May 23 '20 at 06:14

score 0 · Answer 2 · answered May 22 '20 at 21:44

0

We can try

library(dplyr)
nsa %>% 
   group_by(acr) %>%
   filter(all(startyear < endyear))

answered May 22 '20 at 21:44

akrun

874,273
37
540
662

Count how many observations fall between start, end values that vary by group

2 Answers2