0

I perform several wilcoxon-tests for subcategories of my dataset. R performs these tests but displays one big output for each test. I would prefer to have one output for instance in form of a table summarizing all 14 wilcoxon-tests in a tidy manner (name of analized subset, value of test statistic, p-value, outcome e.g. alternative hyothesis:...)

I already tried many tips that I found online but since I am not very much familiar with R I can not analyze the problems, it simply did not work and a friend told me: "stackoverflow is your friend. Ask for help!". Can you help me further?

Best, Roman

This is the code I perform to get my output:

strFlaecheNames<-c(df_summary$Flaeche)
varResult<-array(vector("list",10000),1000)

for(i in 1:14){ 

varResult[i]<-wilcox.test(df1$y,data=df1,subset(df1$y, df1$x ==  strFlaecheNames[i]))

print((wilcox.test(df1$y,data=df1,subset(df1$y, df1$x == strFlaecheNames[i]))))

}

One of my 14 outputs looks like this:

Wilcoxon rank sum test with continuity correction

data:  df1$y and subset(df1$y, df1$x == strFlaecheNames[i])
W = 1170300, p-value = 4.888e-13
alternative hypothesis: true location shift is not equal to 0

Here is a code sample, I also have it in form of reprex but I kind of fail to post it but since the code works I guess it is ok to post it?:

    ed_exp2 <- structure(list(x = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                              1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 7L, 7L, 7L, 7L,
                                              7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("Area1", "Area10", "Area11",
                                                                                      "Area12", "Area13", "Area14", "Area2", "Area3", "Area4", "Area5",
                                                                                      "Area6", "Area7", "Area8", "Area9"), class = "factor"), y = c(0L,
                                                                                                                                                    0L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 2L, 1L,
                                                                                                                                                    2L, 0L, 1L, 0L, -2L, 2L, 0L, 2L, 1L, 2L, 2L, -2L, 0L, 0L)), .Names = c("x",
                                                                                                                                                                                                                           "y"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
                                                                                                                                                                                                                                               11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 169L, 170L, 171L, 172L,
                                                                                                                                                                                                                                               173L, 174L, 175L, 176L, 177L, 178L, 179L, 180L, 181L), class = "data.frame")
#load libraries
library("stats")
library("dplyr")
library("ggpubr")
library("tidyverse")
library("reprex")

strAreaNames<-c("Area1","Area2")

##required size of memory for output unclear - therefore "10000),1000)"
varResult<-array(vector("list",10000),1000)
#run wilcox.test
for(i in 1:2){
  varResult[i]<-wilcox.test(ed_exp2$y,data=ed_exp2,subset(ed_exp2$y, ed_exp2$x == strAreaNames[i]))
  print((wilcox.test(ed_exp2$y,data=ed_exp2,subset(ed_exp2$y, ed_exp2$x == strAreaNames[i]))))
}
Roman
  • 1
  • 3
  • 2
    Welcome to Stack Overflow! Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) – Tung Apr 12 '19 at 14:58
  • Hi, i am sorry for just answering now, i took me quite a bit to understand reprex and still you will see that my output is anything but pretty. However I hope that it might give you an idea of what i am trying to do. – Roman Apr 15 '19 at 19:31

1 Answers1

0

Yes, most statistics functions in R return some sort of nested list, usually because their output is more complex than a single data row. The broom package is designed to take the main output of the most common statistical functions (lm etc) and "tidy" them into a data row.

If I understand correctly, you wish to test whether the mean y within levels of x is different from the overall mean of y, using the unpaired wilcox.test (rank-sum test) for comparison of means. I'm not sure that's a meaningful thing to do, since your samples are obviously not independent, but I'll show you how to do it anyway.

Tidying the output of wilcox.test

library(broom)
tidy(wilcox.test(90:100, 94:100, exact = FALSE))

# A tibble: 1 x 4
  statistic p.value method                                            alternative
      <dbl>   <dbl> <chr>                                             <chr>      
1      24.5   0.220 Wilcoxon rank sum test with continuity correction two.sided 

A tibble is a slightly nicer kind of data frame, part of the tidyverse.

How it works: As explained in ?wilcox.test, that function actually returns a list:

> str(wilcox.test(90:100, 94:100, exact = FALSE))

List of 7
 $ statistic  : Named num 24.5
  ..- attr(*, "names")= chr "W"
 $ parameter  : NULL
 $ p.value    : num 0.22
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "location shift"
 $ alternative: chr "two.sided"
 $ method     : chr "Wilcoxon rank sum test with continuity correction"
 $ data.name  : chr "90:100 and 94:100"
 - attr(*, "class")= chr "htest"

The broom package just extracts the main parts which fit nicely in a dataframe row.

Split-apply-combine, the base R way

Now, you want to do this for every unique value of x (Area1, Area2, etc.), then collect the results in a data frame that shows which subset each result is for. There are many ways this can be done in base R; here is one:

# Example data frame (I'm calling it "d" for brevity)
d <- data.frame(x = c("Area1", "Area1", "Area1", "Area2", "Area2"), y = c(1, 2, 3, 2, 3))
# Empty list to hold our results
L <- list()
for (i in unique(d$x)) {
  # Run Wilcoxon test, "tidy" the result, and assign it to element i of L
  L[[i]] <- tidy(wilcox.test(d$y, d[d$x == i, "y"], exact = FALSE))
}
L
# Combine the results
results <- do.call(rbind, L)  # Same as rbind(L[[1]], L[[2]], ...)
# Add a column identifying the subsets
results$area <- names(L)
results

# A tibble: 2 x 5
  statistic p.value method                                            alternative area 
*     <dbl>   <dbl> <chr>                                             <chr>       <chr>
1       8.5   0.875 Wilcoxon rank sum test with continuity correction two.sided   Area1
2       4     0.834 Wilcoxon rank sum test with continuity correction two.sided   Area2

Split-apply-combine, the tidyverse way

The split-apply-combine workflow is very common, yet somewhat cumbersome to implement in base R. The tidyverse/dplyr way is much more concise and, with some practice, easier to read:

library("tidyverse")
library("broom")

d %>% 
  group_by(x) %>% 
  do(wilcox.test(d$y, .$y, exact = FALSE) %>% tidy)

# A tibble: 2 x 5
# Groups:   x [2]
  x     statistic p.value method                                            alternative
  <fct>     <dbl>   <dbl> <chr>                                             <chr>      
1 Area1       8.5   0.875 Wilcoxon rank sum test with continuity correction two.sided  
2 Area2       4     0.834 Wilcoxon rank sum test with continuity correction two.sided 

Notes:

  • The pipe %>% is syntactic sugar: x %>% f(a) means f(x, a). This makes it easier to write nested function calls in a readable manner. Read the manual on pipes for a deeper understanding.
  • Usually in dplyr you refer to data frame columns by unquoted name, not name the data frame. Your case requires d$y to get all y regardless of the group_by, but this is an exception. Again, read the manual.
  • The dot inside do() refers to the current subset of the data frame, so that .$y are the y in the current group.
Jon Olav Vik
  • 1,421
  • 2
  • 12
  • 20
  • First of all, thanks a lot for your comprehensive answer. – Roman Apr 17 '19 at 09:07
  • I'm trying to implement it and will respond soon. Your comment of wheter the wilcoxon-test is appropriate is maybe helpful. I have a dataset with some hundred observations that can be clustered by their the area (x). As far as i see, the distribution of y-values does not follow a normal-distribution or poisson (I tested this by generating a random normal/poisson distribution, testing wheter y and the other distribution are significantly different). So my distribution is probably not a simple one and another test might not be correct? – Roman Apr 17 '19 at 09:22
  • What you should do depends on what your actual research question is. Try discussing over a cup of coffee with any nearby colleague/data analyst/statistician 8-) – Jon Olav Vik Apr 18 '19 at 09:48
  • That's a fair point, I just thought you had something special in mind. Concerning your input, it worked very well. Thanks therefore and also for your explanation. Is it possible to round a resulting p-value to a "minimal" number e.g. 0.000001 to <0,001 since i got some very small numbers and I feel it might be easier to read like that? – Roman Apr 19 '19 at 19:37
  • data.frame(x=c(0.02, 0.000001)) %>% mutate(x = ifelse(x < 0.001, "<0.001", x)) – Jon Olav Vik Apr 21 '19 at 09:59