0

Sorry in advance for I'm an R newbie. So I was working on Divvy Bike Share data (details see here. Here is a subset of my df:

head(df)

I wanted to visualize the total ridership count (how many times bikes are used) as compressed and shown in a week. I tried two blocks of codes, with the only difference being summarize() - the second one has "month" inside the function. I don't understand what resulted in this difference in y-axis values in the two graphs.

p1 <- df %>% 
  group_by(member_casual, day_of_week) %>%
  summarize(total_rides = n()) %>% 
  ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
  theme(axis.title.x = element_blank()) +
  scale_fill_discrete(name = "") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p1

p1

p2 <- df %>% 
  group_by(member_casual, day_of_week, month) %>%
  summarize(total_rides = n()) %>% 
  ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
  theme(axis.title.x = element_blank()) +
  scale_fill_discrete(name = "") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p2

p2

I tested the tables generated before a plot is visualized, so I tried the following blocks:

df %>% 
  group_by(member_casual, day_of_week) %>%
  summarize(total_rides = n())

tibble 1

df %>% 
  group_by(member_casual, day_of_week, month) %>%
  summarize(total_rides = n())

tibble 2

I guess I understand by adding more elements in group_by, the resulting table will become more catagorized or "grouped". However, the total should always be the same, no? For example, if you add up all the casual & Sundays (as separated into 12 months) in tibble 2, you'll get exactly the number in tibble 1 - 392107, the same number as shown in p1, not p2. So this exacerbated my confusion.

So in a word, I have two questions:

  1. Why the difference in p1 and p2? How could I have avoided such errors in the future?
  2. Where does the numbers come in p2?

Any advice would be greatly appreciated. Thank you!

r2evans
  • 141,215
  • 6
  • 77
  • 149
Abooboo
  • 3
  • 4
  • 1
    Welcome to SO, Abooboo! The issue is that in your second plot or data you have multiple values per day, i.e. on for each month. And you are expecting that these multiple values get stacked so that the bars will show the total per day. That would be behavior if you remove the `position="dodge"` in which case you will get a stacked bar. However, with `position="dodge"` the bars for the groups are no longer stacked. Instead the bars are plotted on top each other, i.e. the value displayed in p2 shows the month with the max. value. – stefan Dec 17 '22 at 12:27
  • 3
    And as a reminder for future questions: Please do not post an image of code/data/errors [for these reasons](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question/285557#285557). Just include the code, console output, or data directly. For more on that have a look at [How to create a great minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – stefan Dec 17 '22 at 12:29
  • 1
    @stefan Thanks a lot stefan for your warm welcome and help! It's my first post in SO and I'll try to adapt to the norms asap! – Abooboo Dec 18 '22 at 12:32

1 Answers1

0

You’re assuming that the counts for each month will be stacked, so that together the column will show the total across all months. But in fact the counts are overplotted in front of one another, so only the highest month-count is visible. You can see this is the case if you add a border and make your columns transparent. Using mpg as an example, with cyl as the “extra” grouping variable:

library(dplyr)
library(ggplot2)

mpg %>%
  count(drv, year, cyl) %>%
  ggplot(aes(year, n, fill = drv)) +
  geom_col(
    position = "dodge",
    color = "black", 
    alpha = .5
  )

NB: count(x) is a shortcut for group_by(x) %>% summarize(n = n()).

zephryl
  • 14,633
  • 3
  • 11
  • 30
  • Thank you zephryl! Your illustration is very eloquent! So..if count(x) = group_by(x) %>% summarize(n = n()), does that mean I can quit using the latter one in every situation? – Abooboo Dec 18 '22 at 12:35
  • @Abooboo you’re welcome, and yes it does! You can also use it with multiple variables (`count(x, y, z)` or `count(across(x:z))`) and control the names of the output columns (`count(new_x = x, name = "total_rides")`). [See](https://dplyr.tidyverse.org/reference/count.html). – zephryl Dec 18 '22 at 13:53