3

I am comparing two groups of lengths (different individuals) with boxplots using ggplot2 package in R. I want to compare the two distributions but so far the only way I found to use a wilcoxon test is stat_compare_means from the "ggpubr" package. Is it the right way to compare the distributions? Can I compare the distribution and not the mean specifically? As you can see, I am a newby in the stat world. Thank you!

FKM
  • 31
  • 2
  • 4
  • If you have statistical questions about chooseing appropriate test, you should ask your question at [stats.se]. Stack overflow is for specific programming question. Here it's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 14 '21 at 17:46
  • Use the correct test for your data type. The Wilcoxon test is for discrete (non-continuous) data, like ages in years. Use a t-test for continuous data. If you need more than this advice on which test to use, then this is the wrong place as @MrFlick noted. I posted an answer describing several ways to do a Wilcoxon test. – Ben Norris Mar 14 '21 at 17:51
  • Thanks both for these comments. It clarifies what and how I should post here. I have seen the answer regarding the wilcoxon test. I think it should be fine using the wilcoxon test comparing the mean for my analysis, but the pairwise wilcoxon test and how to use it was quite helpful as well as reminding me that the Wilcoxon test is for discrete data! – FKM Mar 15 '21 at 20:16

2 Answers2

6

Base R has a built-in function to do a Wilcoxon test: wilcox.test. You can feed it two numeric vectors or a formula relating a numeric variable to a factor variable (with two levels).

# vector input
setosa_SL <- iris$Sepal.Length[which(iris$Species == "setosa")]
versicolor_SL <- iris$Sepal.Length[which(iris$Species == "versicolor")]
wilcox.test(setosa_SL, versicolor_SL)

    Wilcoxon rank sum test with continuity correction

data:  setosa_SL and versicolor_SL
W = 168.5, p-value = 8.346e-14
alternative hypothesis: true location shift is not equal to 0 

# formula input
wilcox.test(Sepal.Length ~ Species, data = iris[which(iris$Species != "virginica"),])

    Wilcoxon rank sum test with continuity correction

data:  Sepal.Length by Species
W = 168.5, p-value = 8.346e-14
alternative hypothesis: true location shift is not equal to 0

However, iris$Species has three levels. What if we wanted to do all three?

The base stats package also has pairwise.wilcox.test.

pairwise.wilcox.test(iris$Sepal.Length, iris$Species)

    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  iris$Sepal.Length and iris$Species 

           setosa  versicolor
versicolor 1.7e-13 -         
virginica  < 2e-16 5.9e-07  

P value adjustment method: holm 

Now, I suspect you want to graph this. You need pairwise_wilcox_test and add_xy_position from the rstatix package and stat_pvalue_manual from the ggpubr package. The pairwise_wilcox_test function is an improvement over the base R pairwise.wilcox.text since returns a tibble rather than a list of class htest.

library(rtatix)
librarr(ggpubr)

iris %>% pairwise_wilcox_test(Sepal.Length ~ Species)

# A tibble: 3 x 9
  .y.          group1     group2        n1    n2 statistic        p    p.adj p.adj.signif
* <chr>        <chr>      <chr>      <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
1 Sepal.Length setosa     versicolor    50    50     168.  8.35e-14 1.67e-13 ****        
2 Sepal.Length setosa     virginica     50    50      38.5 6.40e-17 1.92e-16 ****        
3 Sepal.Length versicolor virginica     50    50     526   5.87e- 7 5.87e- 7 ****    

The function add_xy_positions adds x and y coordinate information to make this data more suitable for plotting, and stat_pvalue_manual adds a layer containing the p-value information.

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  stat_pvalue_manual(iris %>% 
                       pairwise_wilcox_test(Sepal.Length ~ Species) %>% 
                       add_xy_position())

enter image description here

Ben Norris
  • 5,639
  • 2
  • 6
  • 15
0

This info is preleminary:

If you want to test whether your data is normally distributed or not use Kolmogorov-Smirnov test.

If the data is normally distributed use t-test to compare the means of your two groups.

If the data is not normally distributed then use Wilcoxon rank sum test (= Mann Whitney U test) to compare the medians of the two groups. dput() your data and I can show you the code.

TarJae
  • 72,363
  • 6
  • 19
  • 66