1

I am having trouble conducting Wilcoxon test analyses on heavily tied data. I have outlined my problem as best I can below, how I have tried to address it, and the questions I have. I'd be really grateful for any advice anyone could give me.

My Problem I am working on a dataset where I need to compare three groups on a measure which was used for group assignment. When I run a one-way ANOVA, neither (1) the assumption of normality of residuals, nor (2) the assumption of homogeneity of variance of residuals is met.

I therefore used the Wilcoxon test to conduct pairwise comparisons in r with the following code (example for one comparison, two-sided alternative hypothesis as default):

measure ~ group, data= myreduceddataset, na.rm=TRUE, paired=FALSE, exact=TRUE, conf.int=TRUE

However, the output of my analysis looked strange to me (screenshot of example here), and gave up errors for every comparison (one example copied below):

Warning messages: 1: In wilcox.test.default(x = c(2, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, : cannot compute exact p-value with ties 2: In wilcox.test.default(x = c(2, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, : cannot compute exact confidence intervals with ties

Checking the data I then checked the data and looked at how the data are ranked in R to try to figure out the error. It seems as though, although there are some tied ranks throughout, the main problem is the number of 0 values in Group 1 here is some example raw and ranked data by group

The solution I found, and questions this raised From reading around, it appears that the solution to this could be to use the test from the 'Coin' package in R.

I had a go, and here is an example of my output. However, I am still not entirely clear on whether this is correct, and I have outlined some questions I still have below.

  1. I am not sure if an asymptotic test or an exact test is more appropriate for this dataset (the output appears to be the same)
  2. I am assuming I should use the coin::wilcox_test() not the coin::wilcoxsign_test(), as I am comparing samples from independent groups. Is this correct?
  3. If I am understanding correctly, the 'Z' value is the effect size. How do I derive the W statistic? Or can I just report the effect size?
  4. I am not sure how to correct this output for multiple comparisons

I'd be more than happy to give more detail if it would be helpful. Many thanks in advance.

UPDATE: Simulated data (same group means and SDs) here:

structure(list(measure = c(9, 15, 6, 7, 8, 7, 12, 5, 14, 9, 7, 
13, 8, 14, 11, 16, 9, 7, 3, 8, 3, 21, 4, 3, 11, 13, 5, 7, 8, 
15, 5, 15, 3, 9, 5, 2, 8, 6, 1, 1, 7, 6, 9, 5, 6, 2, 6, 10, 6, 
6, 8, 6, 9, 8, 6, 2, 6, 2, 9, 5, 6, 4, 10, 7, 9, 8, 6, 4, 6, 
14, 1, 12, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0), group = structure(c(3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "2", "3"
), class = "factor")), row.names = c(NA, -122L), class = "data.frame")
TarJae
  • 72,363
  • 6
  • 19
  • 66
Zcjth84
  • 13
  • 3
  • *"Warning"* is not an error. Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557 (and https://xkcd.com/2116/). Please just include the code, console output, or data (e.g., `data.frame(...)` or the output from `dput(head(x))`) directly. – r2evans Feb 04 '21 at 13:34
  • Welcome to SO, Zcjth84! This question may not be a good fit for StackOverflow. (1) There is no code and no data; it seems more conceptual, in which case [stats.se] is a much better fit for the discussion. You might get commentary/answers *here* (some users traverse both sites), but that's no guarantee. (2) Even if it stays here on SO, then (again) while this site is about programming, there's very little here to work on. Please some discussion on asking questions *well*: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Feb 04 '21 at 13:36
  • Thanks for your advice - I have now also added it to Cross Validated and will post questions there in future. – Zcjth84 Feb 05 '21 at 10:33

1 Answers1

1

What you need is a Kruskal-Wallis-Test. The non-parametric pendant to ANOVA.

Edit:

library(dplyr)
library((ggpubr)
# group as factor
df$group <- as.factor(df$group)
# check for levels
levels(df$group)
# summarise with dplyr
group_by(df, group) %>%
  summarise(
    count = n(),
    mean = mean(measure, na.rm = TRUE),
    sd = sd(measure, na.rm = TRUE),
    median = median(measure, na.rm = TRUE),
    IQR = IQR(measure, na.rm = TRUE)
  )
# Box Plot measure by group and color by group
library("ggpubr")
ggboxplot(df, x = "group", y = "measure", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("1", "2", "3"),
          ylab = "measure", xlab = "group")

# Mean Plot measure by group and color by group)
ggline(df, x = "group", y = "measure", 
       add = c("mean_se", "jitter"), 
       order = c("1", "2", "3"),
       ylab = "measure", xlab = "group")
# kruskal test
kruskal.test(measure ~ group, data = df)

## output   Kruskal-Wallis rank sum test

## data:  measure by group
## Kruskal-Wallis chi-squared = 92.593, df = 2, p-value < 2.2e-16

### interpretation: There is a significant difference in the group means of group 1,2,3



# pairwise comparisons between group levels
pairwise.wilcox.test(df$measure, df$group,
                     p.adjust.method = "bonferroni")

## output:  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

#data:  df$measure and df$group 

#   1       2    
#   2 4.2e-16 -    
#   3 6.9e-16 0.013

# interpretation: The difference is significant between each group

enter image description here

enter image description here

TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Thanks very much Tarjae for your answer. I went for the Wilcoxon test because the distributions of the data in my groups are quite different (group 1 positively skewed due to all the 0s, others fairly normal). Would you still recommend Kruskall Wallis in this case? – Zcjth84 Feb 04 '21 at 13:38
  • non-parametric means distribution is not important. you are testing on ranks. with that ties also are not primarily important. show me your data and i can give you an example. I think Kruskal-Wallis is what you need. – TarJae Feb 04 '21 at 13:41
  • Hi, again, thanks very much for your help. I have simulated some similar data and added it to my question. Let me know if another format would be more helpful. – Zcjth84 Feb 04 '21 at 15:13