0

I am learning R through Wickham's book and there is something that I do not completely understand. Here the code:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

So what I think happens here is that geom_bar groups by x (=cut), i.e. creates the 'levels', and the in-built stat_count in the geom_bar counts the number of elements in each level of x. In order to get the proportion, we have to use prop. We do this with after_stat because of the in-built stat_count in geom_bar. after_stat(prop), however, takes the number that stat_count outputs for each level and divides it by itself (and NOT the sum of cuts of ALL levels). As a result, we just get bars with height 1. So far so good.

The apparent solution to the problem is this:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))

With this code, I get the correct heights, meaning that most likely, each level x is divided by the sum of cuts in all levels, and not just by the number of elements in the level itself.

Now, I have seen this post. However, it doesn't explain what EXACTLY happens here in chronological order with the new group = 1?

What does R exactly do different now, and in which step?

Thanks for explaining.

Ethan
  • 876
  • 8
  • 18
  • 34
Zius
  • 1
  • 3
  • What do you meany by "chronological" here? When you say R behaves differently now, to what are you comparing it? The answer there seems to explain what the `group=` does. ggplot doesn't make any guarantees *how* this happens. It's an implementation detail. What do you hope to accomplish by knowing which "step" it happens in? Most of the heavy lifting is done in the [stat_count source code](https://github.com/tidyverse/ggplot2/blob/main/R/stat-count.R#L73) – MrFlick Apr 18 '23 at 21:26
  • I deem it important to understand how the code operates in order to flexibly use these elements also in different contexts. I see the difference in the result, and I also get that group = 1 made that difference. What I don't understand is: How does R know through group = 1 that it now has to divide through the sum of ALL levels and not only through the sum of just one of the levels in x (as it previously did)? What does the command group do when it is followed by an integer? – Zius Apr 18 '23 at 21:34
  • 2
    `group=1` is the same as `group="apple"` or `group=99`. All you are doing is assigning every observation to the same group since the value of group is not dependent on any data column. The proportions are calculated per group. By default, if you don't specify group=, it will use the values of `x=` as the group. Setting `group=1` overrides this and makes all observations in the same group. But I feel like I'm just repeating what the accepted answer said on the existing question. – MrFlick Apr 18 '23 at 21:39
  • What makes your answer different is that you say that group = 1 assigns all values to the same group- the old answer didn't do that. With this piece of information, I think I now get it: after_stat(prop) counts the entries in each level of x and now divides by the total numbers of observations because group = 1 is the divisor and puts all observations in one group. In the first code, before we added group=1, after_stat(prop) counted the entries in each level of x but because of the default setting, it divided it by that same number. – Zius Apr 18 '23 at 22:22

0 Answers0