I have a column member_casual that can take up to 3 values. I want to display a histogram in percent for each of those values so that I can compare them. It's important that, during the percentage calculation, count be the number of row with member_casual=value, and not count equal the total number of row. I was able to do so with the following code:
dataCustomer <- tripDataFiles %>%
filter( member_casual == "Customer")
dataSubscriber <- tripDataFiles %>%
filter( member_casual == "Subscriber")
dataDependent <- tripDataFiles %>%
filter( member_casual == "Dependent")
ggplot(dataCustomer, aes(x=tripduration, y = stat(count / sum(count))))+
geom_histogram(aes(fill='customer'), alpha = 0.5)+
geom_histogram(data=dataSubscriber, aes(fill='subscriber'), alpha = 0.5)+
geom_histogram(data=dataDependent, aes(fill='dependent'), alpha = 0.5)+
scale_y_continuous(labels = scales::percent)
that gave me the following graph:
But I'm not satisfied with this code since I have to add lines for each value of member_casual. If member_casual change I will have to rework this code.
Do you know a way to achieve the same result with a code that don't rely on member_casual values?
Thanks
EDIT:
It's data from https://divvy-tripdata.s3.amazonaws.com/index.html
It's data from years 2015 to 2017 that I formated to 2023 format
tripDataFiles17 <- dataFileNames %>%
grep(x = dataFileNames, pattern = '2017', value = TRUE) %>% #Select years 2017
grep(pattern = 'station', x = ., ignore.case = TRUE, invert = TRUE, value = TRUE) %>% #Remove files on stations
lapply(fread) %>% #Read data from the selected file
rbindlist() %>% #Merge data from selected file
rename(
started_at = start_time,
ended_at = end_time
) %>%
mutate(started_at = parse_date_time(started_at,dateTimeFormat), ended_at = parse_date_time(ended_at,dateTimeFormat)) #Convert datetime string to datetime
tripDataFiles <- rbindlist( list(tripDataFiles15_16, tripDataFiles17)) %>%
rename(
ride_id = trip_id,
start_station_id = from_station_id,
start_station_name = from_station_name,
end_station_id = to_station_id,
end_station_name = to_station_name,
member_casual = usertype
)
dput(tripDataFiles[1:20, c("member_casual", "tripduration")])\
structure(list(member_casual = c("Subscriber", "Customer", "Subscriber",
"Customer", "Subscriber", "Subscriber", "Subscriber", "Subscriber",
"Subscriber", "Customer", "Customer", "Customer", "Customer",
"Subscriber", "Subscriber", "Subscriber", "Subscriber", "Subscriber",
"Subscriber", "Subscriber"), tripduration = c(299L, 940L, 751L,
1240L, 1292L, 175L, 930L, 383L, 260L, 1123L, 1167L, 231L, 1092L,
585L, 401L, 177L, 653L, 303L, 223L, 353L)), row.names = c(NA,
-20L), class = c("data.table", "data.frame"), ...)