1

I have a data frame like this:

> head(a)
         FID   IID FLASER PLASER DIABDUR HBA1C ESRD   pheno
1 fam1000-03 G1000      1      1      38  10.2    1 control
2 fam1001-03 G1001      1      1      15   7.3    1 control
3 fam1003-03 G1003      1      2      17   7.0    1    case
4 fam1005-03 G1005      1      1      36   7.7    1 control
5 fam1009-03 G1009      1      1      23   7.6    1 control
6 fam1052-03 G1052      1      1      32   7.3    1 control

My df has 1698 obs of which 828 who have "case" in pheno column and 836 who have "control" in pheno column.

I make a histogram via:

library(ggplot2)
ggplot(a, aes(x=HBA1C, fill=pheno)) + 
  geom_histogram(binwidth=.5, position="dodge")

I would like to have the y-axis show the percentage of individuals which have either "case" or "control" in pheno instead of the count. So percentage would be calculated for each group on y axis ("case" or "control"). I also do have NAs in my plot and it would be good to exclude those from the plot.

I guess I can remove NAs from pheno with this:

ggplot(data=subset(a, !is.na(pheno)), aes(x=HBA1C, fill=pheno)) + geom_histogram(binwidth=.5, position="dodge")

enter image description here

stefan
  • 90,330
  • 6
  • 25
  • 51
anamaria
  • 341
  • 3
  • 11
  • Does this answer your question? [Let ggplot2 histogram show classwise percentages on y axis](https://stackoverflow.com/questions/31200254/let-ggplot2-histogram-show-classwise-percentages-on-y-axis) – stefan May 22 '20 at 07:42
  • I tried that but I got this error: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 44, 66 – anamaria May 22 '20 at 13:02
  • Hi @anamaria. The reason for the error was that you had more than two groups in your dataset (the NAs being the third group). Have a look at my answer. That is a more general approach which works for any number of groups. – stefan May 23 '20 at 08:09

1 Answers1

1

This can be achieved like so:

Note: Concerning the NAs you were right. Simply subset for non-NA values or use dplyr::filter or ...

a <- read.table(text = "id FID   IID FLASER PLASER DIABDUR HBA1C ESRD   pheno
1 fam1000-03 G1000      1      1      38  10.2    1 control
2 fam1001-03 G1001      1      1      15   7.3    1 control
3 fam1003-03 G1003      1      2      17   7.0    1    case
4 fam1005-03 G1005      1      1      36   7.7    1 control
5 fam1009-03 G1009      1      1      23   7.6    1 control
6 fam1052-03 G1052      1      1      32   7.3    1 control
                7 fam1052-03 G1052      1      1      32   7.3    1 NA", header = TRUE)

library(ggplot2)

ggplot(a, aes(x=HBA1C, fill=pheno)) + 
  geom_histogram(aes(y = ..count.. / tapply(..count.., ..group.., sum)[..group..]),
                 position='dodge', binwidth=0.5) +
  scale_y_continuous(labels = scales::percent)

Created on 2020-05-23 by the reprex package (v0.3.0)

stefan
  • 90,330
  • 6
  • 25
  • 51