Wilcoxon Test in a loop for a number of datasets at the same time

Question

I have a question about whether I can do a Wilcoxon test in a loop for all the table generated.

Basically, I want to do a paired Wilcoxon test between 2 variables for each dataset, and the 2 variables are in the same position(like xth and yth column) for every dataset. (For people who are familiar with Biology, in fact this is the RPKM values for like between control and treated sample for some repetitive elements) And I hope I can generate a table for the p-value from Wilcoxon test for each dataset.

I ready generated all the tables/dataset/dataframe using the below code and I think I want to do a Wilcoxon test for each dataset so I think I need to continue with the loop but i don't know how to do it:

data=sample_vs_norm
filter=unique(data$family)

for(i in 1:length(filter)){
  table_name=paste('table_', filter[i], sep="")
  print(table_name)
  assign(table_name, data[data$Subfamily == filter[i]])

here is the structure of a single dataset: so basically i would like to do a Wilcoxon test between the variables "R009_initial_filter_rpkm" and "normal_filter_rpkm"

 Chr     Start       End Mappability Strand R009_initial_filter_NormalizedCounts
1: chr11 113086868 113087173           1      -                                        2
2:  chr2  24290845  24291132           1      -                                       11
3:  chr4  15854425  15854650           1      -                                        0
4:  chr6  43489623  43489676           1      +                                       11
   normal_filter_NormalizedCounts R009_initial_filter_rpkm
1:                                         14.569000                     0.169752
2:                                          1.000000                     0.992191
3:                                         14.815900                     0.000000
4:                                          0.864262                     5.372810
   normal_filter_rpkm FoldChange     p.value         FDR FoldChangeFPKM
1:                              1.236560   0.137278 0.999862671 1.000000000      0.1372776
2:                              0.000000  11.000000 0.003173828 0.008149271            Inf
3:                              1.704630   0.000000 1.000000000 1.000000000      0.0000000
4:                              0.422137  12.727600 0.003173828 0.008149271     12.7276453

structure(list(Chr = structure(1:4, .Label = c("chr11", "chr2", 
"chr4", "chr6"), class = "factor"), Start = c(113086868L, 24290845L, 
15854425L, 43489623L), End = c(113087173L, 24291132L, 15854650L, 
43489676L), Mappability = c(1L, 1L, 1L, 1L), Strand = structure(c(1L, 
1L, 1L, 2L), .Label = c("-", "+"), class = "factor"), R009_initial_filter_NormalizedCounts = c(2L, 
11L, 0L, 11L), normal_filter_NormalizedCounts = c(14.569, 
1, 14.8159, 0.864262), R009_initial_filter_rpkm = c(0.169752, 
0.992191, 0, 5.37281), normal_filter_rpkm = c(1.23656, 
0, 1.70463, 0.422137), FoldChange = c(0.137278, 11, 0, 12.7276
), p.value = c(0.999862671, 0.003173828, 1, 0.003173828), FDR = c(1, 
0.008149271, 1, 0.008149271), FoldChangeFPKM = c(0.1372776, Inf, 
0, 12.7276453), class = "data.frame", row.names = c(NA, 
-4L))

I'm sorry if I use incorrect terminology as I am a newbie in R, and thank you so much for the help

Please edit your post to make a Minimal Reproducible Example. See https://stackoverflow.com/questions/5963269 . You _must_ include some sample data; 4-10 lines is usually enough. Include a sample of what you want the output to look like. REALIZE: Virtually nobody in your audience knows or cares about RPKM or WIlcox tests, and we don't want to do the reading that it would take to understand them. We want help you on how to iterate over many single tests. — David T, May 27 '20 at 02:52
Wrap all your Wilcox code in a single function and make sure it does one test correctly.Then edit your post to ask us how to iterate through all the cases. — David T, May 27 '20 at 02:53
Hi Kwok, welcome to Stack Overflow. I agree with David that your question should be condensed into something shorter, and it is much easier to help if we have a reproducible example. However, I disagree that introducing problem is useless. The [how to ask a good question](https://stackoverflow.com/help/how-to-ask) help page even talks about an introduction. — Ian Campbell, May 27 '20 at 04:00
Thank you so much for the comment, im sorry about the set up of the question because im really new to the coding/R etc as i usually do wet lab before and i don't quite know how to communicate in the language of programmers. I will try to fix it if i can... thx — Kwok Yu LIU, May 27 '20 at 04:04
I have made a change to the question, feel free to give any comment/suggestion/critiques so that i can learn from u guys, thank you so much !! — Kwok Yu LIU, May 27 '20 at 04:09
Based on the formatting of the row names, are you using the `data.table` package? Do you want to perform the `wilcox.test` for all possible `TE_Subfamily`? Or is there a particularly set you are interested in? If so, please provide the output of `dput(filter)`. — Ian Campbell, May 27 '20 at 04:19
I think i am using data.table, as you have said, i want to do the test for all TE_Subfamily such as the one u see (AluYf4), and there are 1000+ TE_Subfamily, so now i basically have 1000+ data tables, each belongs to one subfamily, i add the codes at the very beginning to show how i generate all the data tables — Kwok Yu LIU, May 27 '20 at 05:49

score 1 · Accepted Answer · edited Jun 25 '20 at 02:51

One approach is to use grouping with by = in data.table.

library(data.table)
setDT(data)
data[,wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = TE_Subfamily]
#   TE_Subfamily statistic p.value
#1:       AluYf4       7.5       1

You can group by any number of variables, for example TE_Subfamily and Chr:

data[TE_Subfamily %in% filter,
     wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = .(TE_Subfamily,Chr)]
#   TE_Subfamily   Chr statistic p.value
#1:       AluYf4 chr11         0       1
#2:       AluYf4  chr2         1       1
#3:       AluYf4  chr4         0       1
#4:       AluYf4  chr6         1       1

If you need to only perform comparisons for certain TE_Subfamily, you could try something like this:

filter <- c("AluYf4")
data[TE_Subfamily %in% filter,
     wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = TE_Subfamily]
#   TE_Subfamily statistic p.value
#1:       AluYf4       7.5       1

For bonus points, you can correct for multiple testing:

data[TE_Subfamily %in% filter,
     wilcox.test(R009_initial_filter_rpkm,
                  normal_filter_rpkm)[c("statistic","p.value")],
     by = TE_Subfamily][,adjusted.p.value := p.adjust(p.value,method = "bonferroni")][]

Wilcoxon Test in a loop for a number of datasets at the same time

1 Answers1