dplyr not grouping correctly or else using data from previous groups

Question

I am working with JHU data on coronavirus infections, and I'm trying to compute new cases (and deaths) by group. Here's the code:

base <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-"
world.confirmed <- read.csv(paste0(base,"Confirmed.csv"), sep=',', head=T)
world.confirmed <- gather( world.confirmed, Date, Cases, X1.22.20:X3.21.20)

world.deaths <- read.csv(paste0(base,"Deaths.csv"), sep=',', head=T)
world.deaths <- gather(world.deaths, Date, Deaths, X1.22.20:X3.21.20)

world.data <- merge(world.confirmed, world.deaths, 
                 by=c("Province.State","Country.Region","Lat", "Long", "Date"))

world.data$Date <- as.Date(world.data$Date, "X%m.%d.%y")
world.data <- world.data %>% 
    group_by(Province.State,Country.Region,Date) %>%
    arrange(Province.State, Country.Region, as.Date(Date))

Following solutions to this question in SO I have tried to compute differences by group using something like this:

world.data <- world.data %>% 
   group_by(Lat,Long) %>% 
   mutate(New.Cases = Cases - lag(Cases))

That does not work, however; any other grouping does not either. Here're results on boundary between two first countries:

I have tried also inserting an arrange phase, and even trying to zero the first element of the group. Same problem. Any idea?

Update I'm using R 3.4.4 and dplyr_0.8.5

Hey when I ran your code, line 61 with Albania gives me NA.. so did something get funky in your R session, I am on R.3.6.1, tidyr_1.0.0 dplyr_0.8.3 — StupidWolf, Mar 22 '20 at 13:11

score 1 · Answer 1 · answered Mar 22 '20 at 12:53

1

Probably, this might help :

library(dplyr)

world.data %>%
  mutate(Date = as.Date(Date, "X%m.%d.%y")) %>% 
  arrange(Country.Region, Lat, Long, Date) %>%
  group_by(Country.Region, Lat, Long) %>%
  mutate(New_Cases = Cases - lag(Cases), 
         New_deaths = Deaths - lag(Deaths))

We arrange the data according to Date, and find New_Cases by subtracting today's case with yesterday's case for each Country and the same for deaths.

answered Mar 22 '20 at 12:53

Ronak Shah

377,200
20
156
213

Not really. Dates were already formatted. The only difference I see is that you're grouping and arranging using more columns; anyway, that does not work either (and I don't really see why it should, except you're sorting by date, but as in the example, sorting was already taken care of) – jjmerelo Mar 22 '20 at 13:05
@jjmerelo It would be good to know what do you mean by "working". Can you show what your expected output would look like? If it's difficult to share it for original data please create a small reproducible example and show output for that example. – Ronak Shah Mar 22 '20 at 13:42
I got the same result as above. The problem, as indicated in the comment, was the R version. – jjmerelo Mar 22 '20 at 17:14

dplyr not grouping correctly or else using data from previous groups

1 Answers1