2

I would like to use the geo_bar with facets, obtaining percentage instead of absolute counts, but percentage should be relative to each facet, not relative to the overall count.

This has been discussed a lot (example), suggesting to use geom_bar(aes(y = (..count..)/sum(..count..))). This won't work with facets (i.e. will give total count). A better solution has been suggested, using stat_count(mapping = aes(x=x_val, y=..prop..)) instead.

This seems to work if x is numeric, but not if x is character: all bars are 100%! Why? Am I doing something wrong? Thanks!

library(tidyverse)
df <- data_frame(val_num = c(rep(1, 60), rep(2, 40), rep(1, 30), rep(2, 70)),
             val_cat = ifelse(val_num==1, "cat", "mouse"),
             group=rep(c("A", "B"), each=100))

#works with numeric 
ggplot(df) + stat_count(mapping = aes(x=val_num, y=..prop..)) + facet_grid(group~.)

# does not work? 
ggplot(df) + stat_count(mapping = aes(x=val_cat, y=..prop..)) + facet_grid(group~.)
Matifou
  • 7,968
  • 3
  • 47
  • 52
  • If your x-axis is category, you're essentially asking "which percentage of cats are cats and which percentage of mice are mice?" – lebelinoz Oct 05 '17 at 22:37
  • I think the question is within group/facet A, what's the proportion of cats (or values 1) vs mice (value 2). And same for group B, no? But yes, maybe my question is ill-posed? I still don't see why the behaviour is different for numeric than character? – Matifou Oct 05 '17 at 22:42

1 Answers1

5

Adding group=group tells ggplot to calculate proportions by group, rather than the default, which would be separately for each level of val_cat.

ggplot(df) + 
  stat_count(aes(x=val_cat, y=..prop.., group=group)) + 
  facet_grid(group~.)

enter image description here

When the x-variable is continuous, it looks like stat_count by default calculates percentages over all data in the facet. However, when the x-variable is categorical, stat_count calculates percentages separately within each x level. See what happens with the following examples:

Adding val_num as the group aesthetic causes percentages to be calculated within each x level instead of over all values in a facet.

ggplot(df) + 
  stat_count(aes(x=val_num, y=..prop.., group=val_num)) + 
  facet_grid(group~.)

Turning val_num into a factor likewise causes percentages to be calculated within each x level instead of over all values in a facet.

ggplot(df) + 
  stat_count(aes(x=factor(val_num), y=..prop..)) + 
  facet_grid(group~.)
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • great, well spotted! it's interesting to note that you need to specify the `group` for character values, not for numeric ones. – Matifou Oct 05 '17 at 22:45
  • Yes. Character values represent separate categories and are by default therefore treated as separate groups. On the other hand, numeric values typically represent measurements of some population or process, or time and are treated as elements of a single group. That's why `geom_line` connects points across x values when the x-axis is numeric (all the measurements are within a single group) but not when the x-axis is categorical (each x-axis value is a different group). Same for why color and fill is a gradient for numeric values but separate colors for categorical values. – eipi10 Jan 12 '20 at 18:01