I am a dataset of jokes Dataset 2 (jester_dataset_2.zip) from the Jester project and I would like to divide the jokes into groups of jokes with similar rating and visualize the results appropriately.
The data look like this
> str(tabulka)
'data.frame': 1761439 obs. of 3 variables:
$ User : int 1 1 1 1 1 1 1 1 1 1 ...
$ Joke : int 5 7 8 13 15 16 17 18 19 20 ...
$ Rating: num 0.219 -9.281 -9.281 -6.781 0.875 ...
Here is a subset of Dataset 2.
> head(tabulka)
User Joke Rating
1 1 5 0.219
2 1 7 -9.281
3 1 8 -9.281
4 1 13 -6.781
5 1 15 0.875
6 1 16 -9.656
I found out I can't use ANOVA since the homogenity is not the same. Hence I am using Kruskal–Wallis method from agricolae package in R.
KWtest <- with ( tabulka , kruskal ( Rating , Joke ))
Here are the groups.
> head(KWtest$groups)
trt means M
1 53 1085099 a
2 105 1083264 a
3 89 1077435 ab
4 129 1072706 b
5 35 1070016 bc
6 32 1062102 c
The thing is I don't know how to visualize the joke groups appropriately. I am using boxplot to show the confidence intervals for each joke.
barvy <- c ("yellow", "grey")
boxplot (Rating ~ Joke, data = tabulka,
col = barvy,
xlab = "Joke",
ylab = "Rating",
ylim=c(-7,7))
It would be nice to somehow color each box (each joke) with an appropriate color according to the color given by the KW test.
How could I do that? Or is there some better way to find the best and the worst jokes in the dataset?