13

I am looking for advice on better ways to plot the proportion of observations in various categories.

I have a dataframe that looks something like this:

cat1 <- c("high", "low", "high", "high", "high", "low", "low", "low", "high", "low", "low")
cat2 <- c("1-young", "3-old", "2-middle-aged", "3-old", "2-middle-aged", "2-middle-aged", "1-young", "1-young", "3-old", "3-old", "1-young")
df <- as.data.frame(cbind(cat1, cat2))

In the example here, I want to plot the proportion of each age group that have the value "high", and the proportion of each age group that have the value "low". More generally, I want to plot, for each value of category 2, the percent of observations that fall into each of the levels of category 1.

The following code produces the right result, but only by manually counting and dividing before plotting. Is there a good way to do this on the fly within ggplot?

library(plyr)
count1 <- count(df, vars=c("cat1", "cat2"))
count2 <- count(df, "cat2")

count1$totals <- count2$freq
count1$pct <- count1$freq / count1$totals

ggplot(data = count1, aes(x=cat2, y=pct))+
facet_wrap(~cat1)+
geom_bar()

This previous stackoverflow question offers something similar, with the following code:

ggplot(mydataf, aes(x = foo)) + 
geom_bar(aes(y = (..count..)/sum(..count..)))

But I do not want "sum(..count..)" - which gives the sum of the count of all the bins - in the denominator; rather, I want the sum of the count of each of the "cat2" categories. I have also studied the stat_bin documentation.

I would be grateful for any tips and suggestions on how to make this work.

Community
  • 1
  • 1
user1257313
  • 1,057
  • 4
  • 11
  • 10
  • 1
    In addition to my answer, I'll also point you toward [this](http://stackoverflow.com/a/10888762/324364) answer which might be useful. (But be aware that hacks like that might not survive as ggplot gets updated to subsequent versions.) – joran Jun 14 '12 at 03:53
  • Since that is not a typical summary of data, there is no simple syntax to do it inside of ggplot. Your best approach is to pre-summarize the data, much as you have done. – Brian Diggs Jun 14 '12 at 05:08

2 Answers2

51

I will understand if this isn't really what you're looking for, but I found your description of what you wanted very confusing until I realized that you were simply trying to visualize your data in a way that seemed very unnatural to me.

If someone asked me to produce a graph with the proportions within each category, I'd probably turn to a segmented bar chart:

ggplot(df,aes(x = cat2,fill = cat1)) + 
    geom_bar(position = "fill")

enter image description here

Note the y axis records proportions, not counts, as you wanted.

joran
  • 169,992
  • 32
  • 429
  • 468
  • thanks, this is an excellent solution. Thanks for providing me with a great alternative to what I was trying to do. – user1257313 Jun 14 '12 at 13:49
  • 1
    Thanks, this is a very good but also very simple answer that should help a lot of people who are confusing themselves and overcomplicating things (like me just now). – James Lupolt Mar 22 '15 at 17:35
  • cat1 must be a factor variable, doesn't appear to work if continuous/numeric – Brian D Feb 06 '18 at 21:29
  • @BrianD You'd probably want a `geom_col` or `stat = "identity"` variant in that case. – joran Feb 06 '18 at 21:37
  • @joran I just meant in cases where the factors are numeric. Sometimes people code the groups as 1, 2 or 0, 1 which isn't recognized as a factor and should be converted to factor type for this technique to work. – Brian D Feb 07 '18 at 15:00
8

This might be a bit late for you and it is not involving ggplot, BUT:

I think mosaicplots are the way forward to visualise the interaction of two factors:

cat1 <- c("high", "low", "high", "high", "high", "low", "low", "low", "high", "low", "low")
cat2 <- c("1-young", "3-old", "2-middle-aged", "3-old", "2-middle-aged", "2-middle-aged", "1-young", "1-young", "3-old", "3-old", "1-young")
df <- as.data.frame(cbind(cat1, cat2))

mosaicplot(cat2 ~ cat1, data = df, col = c(lightskyblue2', 'tomato'))

mosaic plot of data with two factors

In this plot, boxes for each value pair are scaled according o the number of observations in that category. You can provide a colour vector to aid with visualisation.

JanLauGe
  • 2,297
  • 2
  • 16
  • 40