Difference in n() count and geom_col graph likely resulted from group_by(), but why and how?

Question

Sorry in advance for I'm an R newbie. So I was working on Divvy Bike Share data (details see here. Here is a subset of my df:

head(df)

I wanted to visualize the total ridership count (how many times bikes are used) as compressed and shown in a week. I tried two blocks of codes, with the only difference being summarize() - the second one has "month" inside the function. I don't understand what resulted in this difference in y-axis values in the two graphs.

p1 <- df %>% 
  group_by(member_casual, day_of_week) %>%
  summarize(total_rides = n()) %>% 
  ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
  theme(axis.title.x = element_blank()) +
  scale_fill_discrete(name = "") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p1

p2 <- df %>% 
  group_by(member_casual, day_of_week, month) %>%
  summarize(total_rides = n()) %>% 
  ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
  theme(axis.title.x = element_blank()) +
  scale_fill_discrete(name = "") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p2

I tested the tables generated before a plot is visualized, so I tried the following blocks:

df %>% 
  group_by(member_casual, day_of_week) %>%
  summarize(total_rides = n())

tibble 1

df %>% 
  group_by(member_casual, day_of_week, month) %>%
  summarize(total_rides = n())

tibble 2

I guess I understand by adding more elements in group_by, the resulting table will become more catagorized or "grouped". However, the total should always be the same, no? For example, if you add up all the casual & Sundays (as separated into 12 months) in tibble 2, you'll get exactly the number in tibble 1 - 392107, the same number as shown in p1, not p2. So this exacerbated my confusion.

So in a word, I have two questions:

Why the difference in p1 and p2? How could I have avoided such errors in the future?
Where does the numbers come in p2?

Any advice would be greatly appreciated. Thank you!

Welcome to SO, Abooboo! The issue is that in your second plot or data you have multiple values per day, i.e. on for each month. And you are expecting that these multiple values get stacked so that the bars will show the total per day. That would be behavior if you remove the `position="dodge"` in which case you will get a stacked bar. However, with `position="dodge"` the bars for the groups are no longer stacked. Instead the bars are plotted on top each other, i.e. the value displayed in p2 shows the month with the max. value. — stefan, Dec 17 '22 at 12:27
And as a reminder for future questions: Please do not post an image of code/data/errors [for these reasons](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question/285557#285557). Just include the code, console output, or data directly. For more on that have a look at [How to create a great minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — stefan, Dec 17 '22 at 12:29
@stefan Thanks a lot stefan for your warm welcome and help! It's my first post in SO and I'll try to adapt to the norms asap! — Abooboo, Dec 18 '22 at 12:32

score 0 · Accepted Answer · answered Dec 17 '22 at 12:33

0

You’re assuming that the counts for each month will be stacked, so that together the column will show the total across all months. But in fact the counts are overplotted in front of one another, so only the highest month-count is visible. You can see this is the case if you add a border and make your columns transparent. Using mpg as an example, with cyl as the “extra” grouping variable:

library(dplyr)
library(ggplot2)

mpg %>%
  count(drv, year, cyl) %>%
  ggplot(aes(year, n, fill = drv)) +
  geom_col(
    position = "dodge",
    color = "black", 
    alpha = .5
  )

NB: count(x) is a shortcut for group_by(x) %>% summarize(n = n()).

answered Dec 17 '22 at 12:33

zephryl

14,633
3
11
30

Thank you zephryl! Your illustration is very eloquent! So..if count(x) = group_by(x) %>% summarize(n = n()), does that mean I can quit using the latter one in every situation? – Abooboo Dec 18 '22 at 12:35
@Abooboo you’re welcome, and yes it does! You can also use it with multiple variables (`count(x, y, z)` or `count(across(x:z))`) and control the names of the output columns (`count(new_x = x, name = "total_rides")`). [See](https://dplyr.tidyverse.org/reference/count.html). – zephryl Dec 18 '22 at 13:53

Difference in n() count and geom_col graph likely resulted from group_by(), but why and how?

1 Answers1