0

I'm completely new to programming and R, but have a dataset that can only be analyzed with a more powerful statistics program such as R.

I have a large but simple dataset consisting of thousands of different groups with multiple samples that I want to compare against the control group with a mann whitney U test, data structure is pictured below.

Group, Measurements
a      0.14534
cont   0.42574
d      0.36347
c      0.14284
a      0.23593
d      0.36347
cont   0.33514
cont   0.29210
b      0.36345
...

The problem comes from that the nature of the test requires that only two groups are designated. However, as I have more than 1 group it does not work.

This is what I have so far and I as you see it does not work in a repeated fashion and only works if I have two groups in my input file.

data1 = read.csv(file.choose(), header=TRUE, stringsAsFactors=FALSE)
attach(data1)
testoutput <- wilcox.test(group ~ measurement, mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
write.table(testoutput$p.value, file="mwUtest.tsv", sep="\t")

How do I do write and loop the test properly for it to test all my groups against my designated control group? I assume the sapply or lapply functions are used before the wilcox.test, but I dont know how.

I'm sorry if this simple question has been brought up before, but I could not find any previous question regarding this specific problem.

talat
  • 68,970
  • 21
  • 126
  • 157
  • Looks like you have a comma as a decimal seperator - if so add `sep = ","` to your `read.csv`. Try `pairwise.wilcox.test` if you want pairwise wilcoxon tests, or kruskal.test – Richard Telford Apr 17 '16 at 12:58
  • Ah sorry, I just made up those numbers on the go and I usually use decimals as separators as I'm a filthy euro, the data in the sheet is formatted correctly and I edited my post to reflect this. Thanks though! – m.andersson Apr 17 '16 at 13:09
  • I understand you're new to R, but word of advice from someone who's been there: don't use attach. Don't get in the habit of using it, because it'll clutter your environment and can lead to weird issues/errors that are hard to debug. – Heroka Apr 17 '16 at 13:36
  • @Heroka This indeed probably wise for newbies, personally I blindly used attach as the initial tutorials I followed did so without explaining, and due to me not knowing what attach does it did actually produce some errors while I tried various commands on the data and similar testsets. Now that I have read up on what it does I can safely exclude it from my analysis. – m.andersson Apr 18 '16 at 16:13

1 Answers1

0

In R, there's often many solutions for the same problem. Here's how I would solve this.

First, I would split my data and have one dataframe with experiments and one with controls:

experiments <- dat[dat$group!="cont",]
controls <- dat[dat$group=="cont",]

Then I would split my experimental data by group, and feed that to my test together with my control measurements. Note that this construction makes it easy to extract more values from the test: just return a (named) vector.

result <- lapply(split(experiments, experiments$group),function(x){
  mytest = wilcox.test(x$measurement,controls$measurement,mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
  return(mytest$p.value)
})

Combining to a table is then easy:

output <- do.call(rbind,result)

Data used:

set.seed(123)
nobs=100
dat <- data.frame(group=sample(c(LETTERS[1:6],"cont"),nobs,T),
                  measurement=runif(nobs),stringsAsFactors=F)
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • Thanks a lot for the nice answer! Your solution worked perfectly fine for solving my problem and was exactly what I was looking for. Though I had some problems testing it, as the test didn't like empty entries, but that one was surprisingly easy to fix in R even for a total newbie (for other R newbies with the same problem this was how http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame). – m.andersson Apr 18 '16 at 15:53