-2

I am working with data set with instances of events (like road-accident, gun violence), with details about events in column.

I want to calculate number of event per million population for each state, for that i have important population per state data.

I have change state as factor and counted the number of event per state:

countbystate <- data1 %>% 
  group_by(state) %>% 
  summarise(count=n())

However, i am unable to generate a object with rate per 1 million population. I tried joining two data state by state hoping calculation world work but i have no luck far.

I am sure this might be quite simple for somebody who is using R since long but i am having hard time calculating this rate metric which i need for visualization.

Thanks in Advance.

Gitesh
  • 13
  • 2
  • 4
    Welcome to Stack Overflow. It's easier to help you if you make your question reproducible including minimal input data and expected output that can be used to test and verify possible solutions Check out [mre] and [ask] – Peter Jun 21 '20 at 15:55

2 Answers2

4

To answer this question one needs to merge a source of U.S. state level population data with the data frame containing the event by state data. Fortunately, the United States taxpayers have paid to have this information collected by funding of the U.S. Census Bureau.

We will use the 2019 U.S. population estimates from the U.S. Census Bureau that are included in a data set of state level population estimates from 2010 - 2019, merge them with yesterday's COVID-19 stats courtesy of our friends at the Johns Hopkins University Center for Systems Science and Engineering, and calculate confirmed cases & death rates per million population.

First, we download and load the US population data, which looks like this.

enter image description here

popData <- "https://www2.census.gov/programs-surveys/popest/tables/2010-2019/state/totals/nst-est2019-01.xlsx"

download.file(popData,
              "./data/nst-est2019-01.xlsx",mode="wb")

Next we load the population data, assign column names and keep the July 2019 estimate. We skip the first few rows of data that are summary rows for the total U.S. and aggregated regions of states, as well as the last row which contains data for Puerto Rico.

library(readxl)
colNames <- c("state","census2010","est_base",paste0("est_",2010:2019))
usPopEst <- read_excel("./data/nst-est2019-01.xlsx",
                       range="A10:M60",
                       col_names = colNames)[c(1,13)]
usPopEst$state <- gsub("\\.","",usPopEst$state)

Now, we read the COVID-19 data as of June 20th, 2020 from my forked version of the JHU CCSE COVID-19 Github repository.

# read covid-19 data for June 20th
covidFile <- "https://raw.githubusercontent.com/lgreski/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/06-20-2020.csv"
covidData <- read.csv(covidFile,
                      header=TRUE)

Since the COVID-19 data is at a county / zip code level, we need to aggregate up to the state level. We'll then merge it with the population estimates and calculate incidence rates per million population.

# aggregate stats to state level
library(dplyr)
covidData %>% filter(Country_Region == "US") %>%
     rename(state = Province_State) %>%
     group_by(state) %>% 
     summarise(Confirmed = sum(Confirmed),
               Deaths = sum(Deaths)) %>% 
     inner_join(.,usPopEst) %>%
     mutate(confirmed_per_million = Confirmed / (est_2019 / 1000000),
            death_per_million = Deaths / (est_2019 / 1000000)) -> summedStates

head(summedStates)

...and the output:

> head(summedStates)
# A tibble: 6 x 6
  state      Confirmed Deaths est_2019 confirmed_per_million death_per_million
  <chr>          <int>  <int>    <dbl>                 <dbl>             <dbl>
1 Alabama        29549    838  4903185                 6026.             171. 
2 Alaska           741     12   731545                 1013.              16.4
3 Arizona        50127   1346  7278717                 6887.             185. 
4 Arkansas       15142    224  3017804                 5018.              74.2
5 California    175213   5494 39512223                 4434.             139. 
6 Colorado       30333   1647  5758736                 5267.             286. 
> 
Len Greski
  • 10,505
  • 2
  • 22
  • 33
0

The above code will count the number of states which is not what you desired. To count some instance according to another factor you have to group that instance as well.

Something like this should work:

countbystate <- data1 %>% 
  group_by(state, road_accident) %>% 
  summarise(count=n())
  • Hi Harish, Each row is one instance, so the above code is giving me correct count of instance of in each state, however my question is how do i calculate rate per million population using population data from other source. – Gitesh Jun 21 '20 at 16:33
  • If you have the data source, you can read it and use **join** functions to create a table that has data about both the instances like **road_accident**, **gun_violence** and **population** and calculate the desired results accordingly. [link] https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right/1300618#1300618 will explain you about using join functions – Harish Kumar Jun 21 '20 at 18:47