-2

I have data on people who died in a train crash and their ages.

For Example:

file <- data.frame(
        Survived = sample(0:1, 100, replace=TRUE),
        Age = sample(0:100, 100, replace=TRUE))

I would like to create a histogram in R where each bin measures the people who died as a percentage of the total amount of people in the data set contained in the bin range.

Here is what I have so far:

hist(file[which(file$Survived==1),]$Age, freq=FALSE)

But this only returns a histogram with the values as a percentage of the whole data set. Like so:Histogram of Sample Data

I need a percentage of the particular age group so that if all the people aged 0-10 died the histogram bar would be at 100% in that age group.

Stedy
  • 7,359
  • 14
  • 57
  • 77
  • Check out `hist()` function setting the `freq` parameter `FALSE` . `hist(yourvariable,freq=F)` – Bea May 29 '17 at 20:02
  • I know how to get percentages of the whole data set. I am looking for the percentage of data contained in the bin. – ElkanaTheGreat May 29 '17 at 20:36
  • for example the amount of people aged 20-40 who died divided by the amount of people in the data set who are aged 20-40 – ElkanaTheGreat May 29 '17 at 20:37
  • 1
    please include in your post a reproducible example: https://stackoverflow.com/help/mcve – Bea May 29 '17 at 20:39
  • I added something but I am not sure exactly what you need. Thanks for your help I really appreciate it. – ElkanaTheGreat May 29 '17 at 21:05
  • to add percentages you need to set `freq=FALSE` in your `hist()` call – Bea May 29 '17 at 21:07
  • edited to address this – ElkanaTheGreat May 29 '17 at 21:43
  • 1
    You will find this much easier if you do use a package. Specifically, `dplyr` and `ggplot2`. See for example this very similar problem: https://stackoverflow.com/questions/41030350/multi-group-histogram-with-group-specific-frequencies – neilfws May 29 '17 at 22:58

1 Answers1

2

I am not sure if I understood well your data, but here is a possibility using barplot function:

#example data    
AGE<-c(rep("<20",6),rep("20-40",6),rep("40-60",9))
set.seed(123)
SURVIVED<-sample(c(0,1), replace=TRUE, size=21)
df<-data.frame(AGE,SURVIVED)

#output of the data
df
     AGE SURVIVED
1    <20        0
2    <20        1
3    <20        0
4    <20        1
5    <20        1
6    <20        0
7  20-40        1
8  20-40        1
9  20-40        1
10 20-40        0
11 20-40        1
12 20-40        0
13 40-60        1
14 40-60        1
15 40-60        0
16 40-60        1
17 40-60        0
18 40-60        0
19 40-60        0
20 40-60        1
21 40-60        1

#the actual code
barplot(prop.table(table(df$SURVIVED,df$AGE), margin =2)[2,])

#and the proportions per group
> prop.table(table(df$SURVIVED,df$AGE), margin =2)

          <20     20-40     40-60
  0 0.5000000 0.3333333 0.4444444
  1 0.5000000 0.6666667 0.5555556

table would give you the frequencies of SURVIVED==1 per age group, and prop.table will get you the percentages.

enter image description here

Is that close to what you were looking for?

Bea
  • 1,110
  • 12
  • 20
  • you need to group your data into categories – Bea May 29 '17 at 21:45
  • There is no cleaner way of doing it? – ElkanaTheGreat May 29 '17 at 21:49
  • Wait sorry I just counted the data you provided this is not what I need at all. These are percentages of the whole data set not the particular age ranges. – ElkanaTheGreat May 29 '17 at 21:54
  • indeed, it's getting late. I have updated the answer. What you have now is the proportion in the group. – Bea May 29 '17 at 22:11
  • Ok thank you for your help. But I want to wait and see if anyone else can do it without transforming to age groups – ElkanaTheGreat May 29 '17 at 22:24
  • 1
    I'm pretty sure you'll have to group your data somehow into bins as GyB has mentioned. If you're just concerned with modifying your data, it should be pretty easy. If your columns are 'ages' and 'survived', split them into groups with something like `splitlist<-split(df, cut(df$ages, seq(0,max(df$ages), by = 20)))` and then barplot that as @GyB suggests `barplot(rbind(lapply(splitlist, function(x) 100*sum(x[,"survived"])/nrow(x))))` – Luke C May 29 '17 at 23:51