0


I have a column member_casual that can take up to 3 values. I want to display a histogram in percent for each of those values so that I can compare them. It's important that, during the percentage calculation, count be the number of row with member_casual=value, and not count equal the total number of row. I was able to do so with the following code:

dataCustomer <- tripDataFiles %>% 
  filter( member_casual == "Customer")
dataSubscriber <- tripDataFiles %>% 
  filter( member_casual == "Subscriber")
dataDependent <- tripDataFiles %>% 
  filter( member_casual == "Dependent")
ggplot(dataCustomer, aes(x=tripduration, y =  stat(count / sum(count))))+
  geom_histogram(aes(fill='customer'), alpha = 0.5)+
  geom_histogram(data=dataSubscriber, aes(fill='subscriber'), alpha = 0.5)+
  geom_histogram(data=dataDependent, aes(fill='dependent'), alpha = 0.5)+
  scale_y_continuous(labels = scales::percent)

that gave me the following graph: enter image description here But I'm not satisfied with this code since I have to add lines for each value of member_casual. If member_casual change I will have to rework this code.
Do you know a way to achieve the same result with a code that don't rely on member_casual values?
Thanks

EDIT:
It's data from https://divvy-tripdata.s3.amazonaws.com/index.html
It's data from years 2015 to 2017 that I formated to 2023 format

tripDataFiles17 <- dataFileNames %>%
  grep(x = dataFileNames, pattern = '2017', value = TRUE) %>% #Select years 2017
  grep(pattern = 'station',  x = ., ignore.case = TRUE, invert = TRUE, value = TRUE) %>% #Remove files on stations
  lapply(fread) %>% #Read data from the selected file
  rbindlist() %>%  #Merge data from selected file
  rename(
    started_at = start_time,
    ended_at = end_time
  ) %>% 
  mutate(started_at = parse_date_time(started_at,dateTimeFormat), ended_at = parse_date_time(ended_at,dateTimeFormat)) #Convert datetime string to datetime

tripDataFiles <- rbindlist( list(tripDataFiles15_16, tripDataFiles17)) %>%
  rename(
    ride_id = trip_id,
    start_station_id = from_station_id,
    start_station_name = from_station_name,
    end_station_id = to_station_id,
    end_station_name = to_station_name,
    member_casual = usertype
  )
dput(tripDataFiles[1:20, c("member_casual", "tripduration")])\
structure(list(member_casual = c("Subscriber", "Customer", "Subscriber", 
"Customer", "Subscriber", "Subscriber", "Subscriber", "Subscriber", 
"Subscriber", "Customer", "Customer", "Customer", "Customer", 
"Subscriber", "Subscriber", "Subscriber", "Subscriber", "Subscriber", 
"Subscriber", "Subscriber"), tripduration = c(299L, 940L, 751L, 
1240L, 1292L, 175L, 930L, 383L, 260L, 1123L, 1167L, 231L, 1092L, 
585L, 401L, 177L, 653L, 303L, 223L, 353L)), row.names = c(NA, 
-20L), class = c("data.table", "data.frame"), ...)
jrcalabrese
  • 2,184
  • 3
  • 10
  • 30
Arthur. R
  • 37
  • 1
  • 8
  • 4
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. – MrFlick Apr 04 '23 at 16:03
  • 1
    Your code looks, odd. `ggplot` prefers a single data frame in long format, and it seems like you start with that and then instead of going directly to the plot, you break it into separate data frames for each group. That's not necessary. If you share 10-20 rows of the relevant columns of `tripDataFiles`, we can help show you how. `dput(tripDataFiles[1:20, c("member_casual", "tripduration")])` will give us the first 20 rows... pick other rows if needed to make sure we get a couple different values of `member_casual`. – Gregor Thomas Apr 04 '23 at 17:14
  • Frankly, we probably don't need _your_ data, you could just fabricate it with `rexp`, `rbeta`, etc. – r2evans Apr 04 '23 at 17:55
  • I separated the dataframe based on member_casual values because this is the easiest way I found to: - display the 3 histogram to comparate them ( I couldn't have them all on the same graph with facet) - have each histogram y-axis calculated based on the member_casual subset size rather than the whole data frame size – Arthur. R Apr 04 '23 at 19:43

1 Answers1

0

I recreated your data to include the Dependent level of member_casual. You can maintain the lengthiness of your data with ggplot() and color each level of member_casual by using fill.

library(tidyverse)
set.seed(123)
member_casual <- sample(x = c("Subscriber", "Customer", "Dependent"), size = 1000, replace = TRUE)
tripduration <- sample(x = 200:3000, size = 1000, replace = TRUE)
tripDataFiles <- data.frame(member_casual, tripduration)
ggplot(tripDataFiles, aes(x = tripduration, 
                          y =  stat(count / sum(count)), 
                          fill = member_casual)) +
  geom_histogram(alpha = 0.5) +
  scale_y_continuous(labels = scales::percent)

enter image description here

jrcalabrese
  • 2,184
  • 3
  • 10
  • 30
  • I want to have a display where all the graphs for each member_casual overlaps and where the y-axis is in % for own population. Here is what the graph looks like when I display your (super nice) data sample with my display code: ![graph](https://imgbox.com/nwLcNni8) – Arthur. R Apr 11 '23 at 09:12
  • I'm not sure I understand; does the above answer not present "all the graphs" overlapping "for each member_causal"? Also, when you say "where the y-axis is in % for own population", do you mean that you want a different y-axis for each level of member_casual? – jrcalabrese Apr 23 '23 at 13:42
  • Yes, all the y-axis are different. I want to create a subset by member_casual value. Build an histo for each subset with the y-axis expressed in % with the denominator being the subset size. Then display all those histograms to the same graph in a way that allow them to be one over the other. This way we can compare the different subsets repartition. If we were to have a unique y-axis expressed in % with denominator being the total set size, then each subset histo would be scaled by it's size. That would get in the way of the comparison – Arthur. R Apr 24 '23 at 13:18