0

I have an Excel file and am trying to create a bar chart that groups categories and shows the average rating of the category. Because there are a lot of categories, I'd also like to only show either the top 10 or bottom 10 in the resulting horizontal bar chart.

 category rating
 A        10
 A        8
 A        9
 B        1
 B        4
 B        9
 C        6
 C        7
 D        9

Something like this (representative bar instead of the numbers):

A 9
D 9
...
C 6.5
B 4.66

I know this seems super simple to do, but I can't seem to be able to get anything working after trying various answers around here. Using ggplot2 seems to be the most promising so far. Closest I've gotten is showing the number of ratings for each category...

Edit: didn't save the work I did earlier as it wasn't the result I wanted, but it was something like this (didn't use ggplot)

dat[,c(1,12)]
category = dat[,1] //selecting column from sheet
rating = dat[,12] //selecting column from sheet
rating<-as.numeric(unlist(dat[,12]))
dat<-table(dat$rating,dat$category)
barplot(dat, main="Overall Ratings",
        xlab="Ratings", col=c("skyblue","red"), horiz=TRUE,
        legend = rownames(dat))
userrandomnums
  • 255
  • 1
  • 2
  • 13
  • First of all, show what you've got until now. You mentioned that you can show the number of ratings, show us how you did that. Also, this example is not reproducible. See this for reference: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Ricardo Fernandes Campos Mar 20 '18 at 03:38
  • @RicardoFernandesCampos I didn't save that work since it wasn't the result I wanted but I added in what it was like. – userrandomnums Mar 20 '18 at 03:46

2 Answers2

2

Here's a chaining solution using dplyr and tidyr. First, we need to load the data.

library(dplyr)
library(tidyr)
library(ggplot2)
df <- read.table(text="category,rating
 A,10
 A,8
 A,9
 B,1
 B,4
 B,9
 C,6
 C,7
 D,9
", sep=",", header=TRUE)

Now to the solution. After grouping the data by category, we calculate each category's mean rating.

means.df <- 
df %>%
    group_by(category) %>%
    summarise(mean = mean(rating))

top_n selects the top (positive number) or bottom (negative number) n rows from a dataset. We apply this to our dataset with means. In your real data, adjust the 2 to 10 for the top and to -10 for the bottom 10 categories.

means.df %>%
    top_n(2, mean) %>%
    ggplot(aes(x = category, y = mean)) +
    geom_bar(stat = 'identity')

The following code plots the top/bottom cutoff_number categories into one plot. Adjust the variable cutoff_number as needed.

cutoff_number <- 2
means.df %>%
    arrange(-mean) %>%
    mutate(
        topbottom = ifelse(row_number() <=  cutoff_number, "top", NA),
        topbottom = ifelse(row_number() > nrow(.) - cutoff_number, "bottom", topbottom)
    ) %>%
    ggplot(aes(x = category, y = mean)) +
    geom_bar(stat = 'identity') +
    facet_wrap(~topbottom, scales = 'free_x')
David
  • 301
  • 2
  • 5
  • Thanks! I just noticed in my data that some categories only have one rating, so I'd like to exclude categories with less than, say, 5 ratings. How do I filter those out? Searching tells me I can do something like dat %>% group_by("category") %>% filter(length(category)>=5) but it still included all for some reason. – userrandomnums Mar 20 '18 at 19:32
  • Got the filtering working. I'm also trying to flip the resulting graph horizontal using coord_flip(), but it also includes categories in the top/bottom graphs that are part of NA (so there's a bunch of blanks). How do I show only the ones that are in the top/bottom in the horizontal graph? (I removed NA completely using drop_na()) – userrandomnums Mar 20 '18 at 23:17
  • How do you filter? If you use something like this, there should not be NAs in the data: `means.df2 <- df %>% group_by(category) %>% summarise( mean = mean(rating), n = n() ) %>% filter(n > 1) %>% select(-n)`. Of course, if you do a `coord_flip()`, both graphs will share the y axis, and hence include all categories. – David Mar 21 '18 at 09:37
  • I added filter(length(airline_name)>=10) %>% after the group_by. Is there a way to flip without including all categories? – userrandomnums Mar 21 '18 at 16:00
  • You can use `scales = 'free_y'` as an argument to `facet_wrap()` in this case. – David Mar 21 '18 at 17:47
0

This solution uses data.table to summarize the data, then delivers the result to ggplot:

library(data.table);library(ggplot2)
category=c("A","A","A","B","B","B","C","C","D")
rating=c(10,9,8,1,4,9,6,7,9)
dt=as.data.table(cbind(category,rating))
ggplot(dt[,mean(as.numeric(rating)),by=category],aes(category,V1))+geom_col()+ylab("Mean")