0

I acquire the data set of Coronavirus in the US from The New York Times which includes date and accumulative cases up to that date. In what way I can extract and plot new cases per day using ggpplot in R?

The data set: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv

Nha Binh Chang
  • 37
  • 1
  • 1
  • 9
  • What code have you try so far ? Can you provide a minimal reproducible example of what data looks like ? A draw of the graph you are looking for ? – dc37 Apr 06 '20 at 03:23
  • They provide date including a date col (y-m-d) which i have already converted to class Date, state name, state code (FIBS), accumulative cases and deaths recorded up to that date. For example, up to April 4 there are ~114000 cases in New York. I want use ggplot with geom_bar to create a bar plot for new cases by day but I haven't had any idea how to calculate that value effectively? – Nha Binh Chang Apr 06 '20 at 03:35
  • Maybe before starting with such a complex dataset you should start by generating a smaller and simpler dataset that you can provide in your question order we assist you with that. Without data, I can't guess what code is required for your question. – dc37 Apr 06 '20 at 03:40
  • Got it! I have edited my post and added the link to the raw dataset. Sorry I'm quite new to the platform so my question is a little lacking here and there ^^ – Nha Binh Chang Apr 06 '20 at 03:49
  • As your dataset has several thousands of lines, it's not really nice to propose to download that to people trying to help you. Instead, you should read this guide of how to provide https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and provide just the first line of your dataset by copy/pasting in your question `dput(head(df))`. – dc37 Apr 06 '20 at 03:57
  • ok I got it! Thank you – Nha Binh Chang Apr 06 '20 at 04:33

1 Answers1

0

Assuming you have only two columns, one for dates and one for cumulative cases, you can get the number of cases by substracting the cumulative of one day by the value of the day before.

In dplyr, you can use lag function for that:

Here a fake and reproducible dataset (I intentionally keep orogonal cases values that I provided to show the correct calculation)

df <- data.frame(date = seq(ymd("2020-01-01"),ymd("2020-01-10"),by = "day"),
                 cases = sample(10:100,10))
df$cumCase <- cumsum(df$cases)

library(dplyr)

df %>% mutate(Orig_cases = ifelse(row_number()==1, cumCase, cumCase - lag(cumCase)))

         date cases cumCase Orig_cases
1  2020-01-01    88      88         88
2  2020-01-02    49     137         49
3  2020-01-03    14     151         14
4  2020-01-04    35     186         35
5  2020-01-05    67     253         67
6  2020-01-06    23     276         23
7  2020-01-07    95     371         95
8  2020-01-08    63     434         63
9  2020-01-09    17     451         17
10 2020-01-10    90     541         90

Now, you have the correct calculation, you can pass it to ggplot by doing:

library(dplyr)
library(ggplot2)

df %>% mutate(Orig_cases = ifelse(row_number()==1, cumCase, cumCase - lag(cumCase)))# %>% 
  ggplot(aes(x = date, y = Orig_cases))+
  geom_col()+
  geom_line(aes(y = cumCase, group  = 1))

enter image description here

dc37
  • 15,840
  • 4
  • 15
  • 32