I am comparing two groups of lengths (different individuals) with boxplots using ggplot2 package in R. I want to compare the two distributions but so far the only way I found to use a wilcoxon test is stat_compare_means from the "ggpubr" package. Is it the right way to compare the distributions? Can I compare the distribution and not the mean specifically? As you can see, I am a newby in the stat world. Thank you!
-
If you have statistical questions about chooseing appropriate test, you should ask your question at [stats.se]. Stack overflow is for specific programming question. Here it's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 14 '21 at 17:46
-
Use the correct test for your data type. The Wilcoxon test is for discrete (non-continuous) data, like ages in years. Use a t-test for continuous data. If you need more than this advice on which test to use, then this is the wrong place as @MrFlick noted. I posted an answer describing several ways to do a Wilcoxon test. – Ben Norris Mar 14 '21 at 17:51
-
Thanks both for these comments. It clarifies what and how I should post here. I have seen the answer regarding the wilcoxon test. I think it should be fine using the wilcoxon test comparing the mean for my analysis, but the pairwise wilcoxon test and how to use it was quite helpful as well as reminding me that the Wilcoxon test is for discrete data! – FKM Mar 15 '21 at 20:16
2 Answers
Base R has a built-in function to do a Wilcoxon test: wilcox.test
. You can feed it two numeric vectors or a formula relating a numeric variable to a factor variable (with two levels).
# vector input
setosa_SL <- iris$Sepal.Length[which(iris$Species == "setosa")]
versicolor_SL <- iris$Sepal.Length[which(iris$Species == "versicolor")]
wilcox.test(setosa_SL, versicolor_SL)
Wilcoxon rank sum test with continuity correction
data: setosa_SL and versicolor_SL
W = 168.5, p-value = 8.346e-14
alternative hypothesis: true location shift is not equal to 0
# formula input
wilcox.test(Sepal.Length ~ Species, data = iris[which(iris$Species != "virginica"),])
Wilcoxon rank sum test with continuity correction
data: Sepal.Length by Species
W = 168.5, p-value = 8.346e-14
alternative hypothesis: true location shift is not equal to 0
However, iris$Species
has three levels. What if we wanted to do all three?
The base stats
package also has pairwise.wilcox.test
.
pairwise.wilcox.test(iris$Sepal.Length, iris$Species)
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: iris$Sepal.Length and iris$Species
setosa versicolor
versicolor 1.7e-13 -
virginica < 2e-16 5.9e-07
P value adjustment method: holm
Now, I suspect you want to graph this. You need pairwise_wilcox_test
and add_xy_position
from the rstatix
package and stat_pvalue_manual
from the ggpubr
package. The pairwise_wilcox_test
function is an improvement over the base R pairwise.wilcox.text
since returns a tibble rather than a list of class htest
.
library(rtatix)
librarr(ggpubr)
iris %>% pairwise_wilcox_test(Sepal.Length ~ Species)
# A tibble: 3 x 9
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 Sepal.Length setosa versicolor 50 50 168. 8.35e-14 1.67e-13 ****
2 Sepal.Length setosa virginica 50 50 38.5 6.40e-17 1.92e-16 ****
3 Sepal.Length versicolor virginica 50 50 526 5.87e- 7 5.87e- 7 ****
The function add_xy_positions
adds x and y coordinate information to make this data more suitable for plotting, and stat_pvalue_manual
adds a layer containing the p-value information.
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot() +
stat_pvalue_manual(iris %>%
pairwise_wilcox_test(Sepal.Length ~ Species) %>%
add_xy_position())

- 5,639
- 2
- 6
- 15
This info is preleminary:
If you want to test whether your data is normally distributed or not use Kolmogorov-Smirnov test.
If the data is normally distributed use t-test to compare the means of your two groups.
If the data is not normally distributed then use Wilcoxon rank sum test (= Mann Whitney U test) to compare the medians of the two groups.
dput()
your data and I can show you the code.

- 72,363
- 6
- 19
- 66