1

Good Afternoon,

After trying several times R will not sum up the data I have below. As can be seen in the replica of my data there are 4 33024 zipcodes listed. R will continue to say that 33024 only has 2 injuries and will sum the rest of them up. Any help on this?

Edit: This should help as well. Seeing the Max stay at 3 and not increase based on the number of zip-codes that have an injury.

ZipCode         Age        Fatality       Injury        Year   
 33065  : 24   15     :28   Min.   :1     Min.   :1.000   2015:92  
 33313  : 18   18     :27   1st Qu.:1     1st Qu.:1.000   2016:67  
 33317  : 14   13     :21   Median :1     Median :1.000   2017:35  
 33076  : 13   17     :19   Mean   :1     Mean   :1.083            
 33026  : 11   12     :18   3rd Qu.:1     3rd Qu.:1.000            
 33311  : 11   14     :18   Max.   :1     Max.   :3.000 
  ZipCode Age Fatality Injury Year
1   33023  17       NA      1 2015
2   33024   6       NA      1 2015
3   33024   8       NA      2 2015
4   33024  13       NA      1 2015
5   33024  13       NA      1 2015
6   33026  14       NA      1 2015
BCD = read.csv(file.choose())
BCD

head(BCD)
tail(BCD)

library(ggplot2)
str(BCD)

colnames(BCD) = c("ZipCode", "Age", "Fatality", "Injury", "Year")
head(BCD)

list(BCD$Injury)
list(BCD$ZipCode)

factor(BCD$Year)
factor(BCD$ZipCode)

BCD$Year= factor(BCD$Year)
BCD$ZipCode= factor(BCD$ZipCode)
BCD$Age = factor(BCD$Age)
BCD$Injury = as.numeric(BCD$Injury)
BCD$Fatality = as.numeric(BCD$Fatality)
str(BCD)
head(BCD)
summary(BCD)


BCD2 = ggplot(data=BCD, aes(x=Injury, y=ZipCode, color=Age, size=Year))
BCD2 + geom_point()+ geom_smooth()

This is the code to this point. I am attempting to produce a ggplot based on year, age, zipcode, and the number of injuries that occurred at that zip-code.

  • Sorry meant to say that it will not sum the rest of them up with the 2. Also the 5 column from left to right is injuries. Thank you again. – ZeDandyMan56 Jan 22 '20 at 19:35
  • 3
    Can you provide the code you try to sum up your data ? – dc37 Jan 22 '20 at 19:36
  • @ZeDandyMan56, your question is unclear. What are you trying to do? Are you trying to sum up the Injury column by zipcode? – Sheila Jan 22 '20 at 19:36
  • [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data, all necessary code, and a clear explanation of what you're trying to do and what hasn't worked. – camille Jan 22 '20 at 19:48
  • We need more information about your data. Use `str(BCD)` and show us the output. It appears that ZipCode, Age, and Year are factors, not numeric values. You cannot sum factors, but you can tabulate the number of rows that have each factor level. Fatality and Injury are numeric, but they may actually be codes of some kind. – dcarlson Jan 22 '20 at 23:42

1 Answers1

0

the summary function from R is giving you the maximal point value not the cumulative sum in the vector "Injury" and do not take in consideration grouping per the column ZipCode.

To calculate the cumulative sum of injuries per ZipCode, you need to group_y ZipCode and then apply cumsum function. You can do that using dplyr package.

library(dplyr)
df %>% group_by(ZipCode) %>% 
  mutate(CumSumInjury = cumsum(Injury))

# A tibble: 6 x 7
# Groups:   ZipCode [3]
    Row ZipCode   Age Fatality Injury  Year CumSumInjury
  <int>   <int> <int> <lgl>     <int> <int>        <int>
1     1   33023    17 NA            1  2015            1
2     2   33024     6 NA            1  2015            1
3     3   33024     8 NA            2  2015            3
4     4   33024    13 NA            1  2015            4
5     5   33024    13 NA            1  2015            5
6     6   33026    14 NA            1  2015            1

Combining it with ggplot, you can get the following plot:

library(dplyr)
library(ggplot2)
df %>% group_by(ZipCode) %>% 
  mutate(CumSumInjury = cumsum(Injury)) %>%
  ggplot(aes(x = as.factor(ZipCode), y = CumSumInjury, color = Age, size = Year))+
  geom_point()

enter image description here

dc37
  • 15,840
  • 4
  • 15
  • 32
  • Yes and no, I do want the duplicate zipcode entries to be summed but in the results of the ggplot I am attempting. – ZeDandyMan56 Jan 22 '20 at 19:50
  • For instance when looking at the summary of my database the max amount of Injury stays at 3 even though i know the max is much higher. – ZeDandyMan56 Jan 22 '20 at 19:51
  • What do you mean by max ? are you looking for the cumulative sum of injury per zipcode ? – dc37 Jan 22 '20 at 20:05
  • Yes, I have listed it above. If you see the summary it only goes up to three and will not increase based off the zipcode. – ZeDandyMan56 Jan 22 '20 at 20:08