1

I have time series panel data in R (organized by country-year) and am trying to add a variable to the data frame that counts the number of observations that equal "1" in a binary variable for each unique year value, and returns that count for each unique year value (so that number should be the same for Country A, 1995 and Country B, 1995).

The binary variable also has NA for some values, which I want to ignore (not replace with "0" as that will adversely affect my data for analysis purposes later).

I have tried using the cumulative sum method described here (Calculate cumsum() while ignoring NA values) but that cumulates within years as well; it does not deliver the same value for same-year values.

x <- c("USA","USA","USA","USA","UK","UK","UK","UK","India","India","India","India","China","China","China","China","Mexico","Mexico","Mexico","Mexico","Canada","Canada","Canada","Canada")
y <- c(1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998,1995,1996,1997,1998)
z <- c(0,0,0,0,0,0,1,1,0,1,1,1,NA,NA,NA,NA,0,0,0,1,NA,NA,0,1)
x_name <- "country"
y_name <- "year"
z_name <- "status"

df <- data.frame(x,y,z)
names(df) <- c(x_name,y_name,z_name)
print(df)

df <- df %>%
group_by(year) %>%
mutate(total_status = cumsum(coalesce(status, 0)) + status*0) %>% 
ungroup()

As described, the code above produces this output, which is not what I want.

country year    status  total_status
USA     1995    0       0
USA     1996    0       0
USA     1997    0       0
USA     1998    0       0
UK      1995    0       0
UK      1996    0       0
UK      1997    1       1
UK      1998    1       2
India   1995    0       2
India   1996    1       3
India   1997    1       4
India   1998    1       5
China   1995    NA      NA
China   1996    NA      NA
China   1997    NA      NA
China   1998    NA      NA
Mexico  1995    0       5
Mexico  1996    0       5
Mexico  1997    0       5
Mexico  1998    1       6
Canada  1995    NA      NA
Canada  1996    NA      NA
Canada  1997    0       6
Canada  1998    1       7

The expected output should be:

   country year status total_status
1      USA 1995      0 0
2      USA 1996      0 1
3      USA 1997      0 2
4      USA 1998      0 4
5       UK 1995      0 0
6       UK 1996      0 1
7       UK 1997      1 2
8       UK 1998      1 4
9    India 1995      0 0
10   India 1996      1 1
11   India 1997      1 2
12   India 1998      1 4
13   China 1995     NA 0
14   China 1996     NA 1
15   China 1997     NA 2
16   China 1998     NA 4
17  Mexico 1995      0 0
18  Mexico 1996      0 1
19  Mexico 1997      0 2
20  Mexico 1998      1 4
21  Canada 1995     NA 0
22  Canada 1996     NA 1
23  Canada 1997      0 2
24  Canada 1998      1 4

Is there a way to do this, preferably using dplyr? Thank you to whomever can help!

Abe
  • 393
  • 2
  • 13
  • 1
    That is correct. If you look down through the rows for like-years, there are no values of "1" in all the 1995 observations, there is one value of "1" in all the 1996 observations, two values of "1" in 1997, and four values of "1" in 1998. That is exactly what I'm trying to produce (albeit in a much larger scale in my real data). You'll see the UK, India, Mexico, and Canada all have "1" in 1998. – Abe Jan 17 '19 at 07:45

1 Answers1

2

I think it's just a sum() issue:

df %>%
  group_by(year) %>%
  mutate(total_status = sum(status, na.rm = T))

#    country year status total_status
# 1      USA 1995      0            0
# 2      USA 1996      0            1
# 3      USA 1997      0            2
# 4      USA 1998      0            4
# 5       UK 1995      0            0
# 6       UK 1996      0            1
# 7       UK 1997      1            2
# 8       UK 1998      1            4
# 9    India 1995      0            0
# 10   India 1996      1            1
# 11   India 1997      1            2
# 12   India 1998      1            4
# 13   China 1995     NA            0
# 14   China 1996     NA            1
# 15   China 1997     NA            2
# 16   China 1998     NA            4
# 17  Mexico 1995      0            0
# 18  Mexico 1996      0            1
# 19  Mexico 1997      0            2
# 20  Mexico 1998      1            4
# 21  Canada 1995     NA            0
# 22  Canada 1996     NA            1
# 23  Canada 1997      0            2
# 24  Canada 1998      1            4

Edit

Why do the group_by() and mutate() not work? It may be relevant to this issue: Why are my dplyr group_by & summarize not working properly? (name-collision with plyr) . To sum up, certain packages interfere with the dplyr such as plyr. You can change your command to dplyr::command() to call functions in dplyr directly. In this case it will be:

df %>%
  dplyr::group_by(year) %>%
  dplyr::mutate(total_status = sum(status, na.rm = T))

This way seems a little complicated. You can unload the interferential packages by detach(package:plyr).

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • Unfortunately, that does not work. It merely produces a variable with "7" as the value in every observation (the total number of "1" values across all years). – Abe Jan 17 '19 at 08:00
  • @AbeBarranca I use the reproducible data you provide in the question and the output I show is the same as what you expect.(The output I show above is the output my computer prints) – Darren Tsai Jan 17 '19 at 08:05
  • @AbeBarranca check my edit. The data I show is what you provide, right? – Darren Tsai Jan 17 '19 at 08:12
  • Darren, I'm not sure how that is. I reran what you posted and still got a variable of all "7"s. See below. `> df2 <- df %>%` `+ group_by(year) %>% mutate(total_status = sum(status, na.rm = T))` `> df2` `# A tibble: 24 x 4` `# Groups: year [4]` ` country year status total_status` ` ` ` 1 USA 1995 0 7` ` 2 USA 1996 0 7` ` 3 USA 1997 0 7` ` 4 USA 1998 0 7` ` 5 UK 1995 0 7` ` 6 UK 1996 0 7` – Abe Jan 17 '19 at 08:18
  • Yes, that data reproduction looks accurate. I'm still getting the same result of a vector of "7"s, as I said. Sorry about the last comment; I accidentally left a space inside of the mini markdown separators and it won't let me go back in and correct (because of the five-minute comment rule). – Abe Jan 17 '19 at 08:26
  • @AbeBarranca try this: Try this: `df %>% dplyr::group_by(year) %>% dplyr::mutate(total_status = sum(status, na.rm = T))` – Darren Tsai Jan 17 '19 at 08:47
  • That worked perfectly, thank you! Can you explain the difference produced by dplyr::command? – Abe Jan 17 '19 at 08:48
  • Interesting. So it may be the result of a conflict with plyr? (I did have it loaded too.) Just so I'm understanding: using the dplyr::command syntax calls dplyr directly in a way that does not cause this interference? Also, if you edit your response above with the correct, I'd be glad to up-vote it! – Abe Jan 17 '19 at 08:57