I have time series panel data in R (organized by country-year) and am trying to add a variable to the data frame that counts the number of observations that equal "1" in a binary variable for each unique year value, and returns that count for each unique year value (so that number should be the same for Country A, 1995 and Country B, 1995).
The binary variable also has NA for some values, which I want to ignore (not replace with "0" as that will adversely affect my data for analysis purposes later).
I have tried using the cumulative sum method described here (Calculate cumsum() while ignoring NA values) but that cumulates within years as well; it does not deliver the same value for same-year values.
x <- c("USA","USA","USA","USA","UK","UK","UK","UK","India","India","India","India","China","China","China","China","Mexico","Mexico","Mexico","Mexico","Canada","Canada","Canada","Canada")
y <- c(1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998)
z <- c(0,0,0,0,0,0,1,1,0,1,1,1,NA,NA,NA,NA,0,0,0,1,NA,NA,0,1)
x_name <- "country"
y_name <- "year"
z_name <- "status"
df <- data.frame(x,y,z)
names(df) <- c(x_name,y_name,z_name)
print(df)
df <- df %>%
group_by(year) %>%
mutate(total_status = cumsum(coalesce(status, 0)) + status*0) %>%
ungroup()
As described, the code above produces this output, which is not what I want.
country year status total_status
USA 1995 0 0
USA 1996 0 0
USA 1997 0 0
USA 1998 0 0
UK 1995 0 0
UK 1996 0 0
UK 1997 1 1
UK 1998 1 2
India 1995 0 2
India 1996 1 3
India 1997 1 4
India 1998 1 5
China 1995 NA NA
China 1996 NA NA
China 1997 NA NA
China 1998 NA NA
Mexico 1995 0 5
Mexico 1996 0 5
Mexico 1997 0 5
Mexico 1998 1 6
Canada 1995 NA NA
Canada 1996 NA NA
Canada 1997 0 6
Canada 1998 1 7
The expected output should be:
country year status total_status
1 USA 1995 0 0
2 USA 1996 0 1
3 USA 1997 0 2
4 USA 1998 0 4
5 UK 1995 0 0
6 UK 1996 0 1
7 UK 1997 1 2
8 UK 1998 1 4
9 India 1995 0 0
10 India 1996 1 1
11 India 1997 1 2
12 India 1998 1 4
13 China 1995 NA 0
14 China 1996 NA 1
15 China 1997 NA 2
16 China 1998 NA 4
17 Mexico 1995 0 0
18 Mexico 1996 0 1
19 Mexico 1997 0 2
20 Mexico 1998 1 4
21 Canada 1995 NA 0
22 Canada 1996 NA 1
23 Canada 1997 0 2
24 Canada 1998 1 4
Is there a way to do this, preferably using dplyr? Thank you to whomever can help!