Find and visualize best and worst items using boxplot

Question

I am a dataset of jokes Dataset 2 (jester_dataset_2.zip) from the Jester project and I would like to divide the jokes into groups of jokes with similar rating and visualize the results appropriately.

The data look like this

> str(tabulka)
'data.frame':   1761439 obs. of  3 variables:
 $ User  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Joke  : int  5 7 8 13 15 16 17 18 19 20 ...
 $ Rating: num  0.219 -9.281 -9.281 -6.781 0.875 ...

Here is a subset of Dataset 2.

> head(tabulka)
  User Joke Rating
1    1    5  0.219
2    1    7 -9.281
3    1    8 -9.281
4    1   13 -6.781
5    1   15  0.875
6    1   16 -9.656

I found out I can't use ANOVA since the homogenity is not the same. Hence I am using Kruskal–Wallis method from agricolae package in R.

KWtest <- with ( tabulka , kruskal ( Rating , Joke ))

Here are the groups.

> head(KWtest$groups)
  trt   means  M
1  53 1085099  a
2 105 1083264  a
3  89 1077435 ab
4 129 1072706  b
5  35 1070016 bc
6  32 1062102  c

The thing is I don't know how to visualize the joke groups appropriately. I am using boxplot to show the confidence intervals for each joke.

barvy <- c ("yellow", "grey")
boxplot (Rating ~ Joke, data = tabulka,
         col = barvy,
         xlab = "Joke",
         ylab = "Rating",
         ylim=c(-7,7))

It would be nice to somehow color each box (each joke) with an appropriate color according to the color given by the KW test.

How could I do that? Or is there some better way to find the best and the worst jokes in the dataset?

Could you post a subset of your data, making your question self-contained and reproducible. — AkselA, Feb 18 '19 at 10:43
@AkselA I have just included it. There are 140 jokes rated by 59132 users by 1.7 million ratings. Each rating is a real value from -10 to 10. — Slazer, Feb 18 '19 at 10:49
@AkselA I am not sure what you are asking for. The used dataset is freely available, the link is included. The `KWtest` object is the result of `kruskal` method from `agricolae` package run on the given dataset. — Slazer, Feb 18 '19 at 20:54
Q&As on SO are expected to be, as far as possible, MCVEs, not just now, but also in the future. That means no relying on links, because links, as we all know, breaks. — AkselA, Feb 18 '19 at 22:28
And `jester_dataset_2` contains two `dat` files. What are we supposed to do with those? — AkselA, Feb 19 '19 at 00:05

score 2 · Answer 1 · answered Feb 18 '19 at 11:51

2

Interesting question per se. It's easy to color each bar according to the group the joke belongs to. However, I think it is just a intermediate solution, there must be better visualization for these data. So, certainly not the best one, but there is my version:

library(tidyverse)

# download data (jokes, part 1) to temporaty file, and unzip
tmp <- tempfile()
download.file("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip", tmp)
tmp <- unzip(tmp)

# read data from temp
vtipy <- readxl::read_excel(tmp, col_names = F, na = '99')

# clean data
vtipy <- vtipy %>%
  mutate(user = 1:n()) %>%
  gather(key = 'joke', value = 'rating', -c('..1', 'user')) %>%
  rename(n = '..1', ) %>%
  filter(!is.na(rating)) %>%
  mutate(joke = as.character(as.numeric(gsub('\\.+', '', joke)) - 1)) %>%
  select(user, n, joke, rating)

# your code
KWtest <- with(vtipy, agricolae::kruskal(rating, joke))

# join groups from KWtest to original data, clean and plot
KWtest$groups %>%
  rownames_to_column('joke') %>%
  select(joke, groups) %>%
  right_join(vtipy, by = 'joke') %>% 
  mutate(joke = stringi::stri_pad_left(joke, 3, '0')) %>%
  ggplot(aes(x = joke, y = rating, fill = groups)) +
  geom_boxplot(show.legend = F) +
  scale_x_discrete(breaks = stringi::stri_pad_left(c(1, seq(5, 100, by = 5)), 3, '0')) +
  ggthemes::theme_tufte() +
  labs(x = 'Joke', y = 'Rating')

answered Feb 18 '19 at 11:51

utubun

4,400
1
14
17

Sorry, I forgot to mention I am using Dataset 2 ('jester_dataset_2.zip'). Is it difficult to modify the code to use the second dataset? When I run your code to plot the groups I get the error `Error: All select() inputs must resolve to integer column positions. The following do not: * groups`. – Slazer Feb 18 '19 at 20:44
1

I just found out this approach might be mathematically wrong. `It is tricky to know how to visually display the results of a Kruskal–Wallis test. It would be misleading to plot the means or medians on a bar graph, as the Kruskal–Wallis test is not a test of the difference in means or medians.` Source: http://www.biostathandbook.com/kruskalwallis.html – Slazer Feb 18 '19 at 20:44
So how do I reasonably find (and visualize) the best and worst jokes? – Slazer Feb 18 '19 at 20:48
It's not difficult. But if you think it is incorrect, it's better to find a correct type of visualization. I'd like to participate, that's cool dataset. – utubun Feb 18 '19 at 20:48
Let's first try to answer the question of what are the best and worse jokes in the dataset. We can then visualize them in the boxplot by two differenct colors. The thing is how to take into account say the number of ratings. One fivestar is probably not better than thousand fourstars. I wonder if Krushal-Wallis is usefull at all in this matter. – Slazer Feb 18 '19 at 21:05

Find and visualize best and worst items using boxplot

1 Answers1