0

I have a data frame 90 observations and 124306 variables named KWR all numeric data. I want to run a Kruskal Wallis analysis within every column between groups. I added a vector with every different group behind my variables named "Group". To test the accuracy, I tested one peptide (named x2461) with this code:

kruskal.test(X2461 ~ Group, data = KWR)

Which worked out fine and got me a result instantly. However, I need all the variables to be analyzed. I used lapply while reading this post: How to loop Bartlett test and Kruskal tests for multiple columns in a dataframe?

cols <- names(KWR)[1:124306]
allKWR <- lapply(cols, function(x) kruskal.test(reformulate("Group", x), data = KWR))

However, after 2 hours of R working non stop, I quit the job. Is there any more efficient way of doing this?

Thanks in advance.

NB: first time poster, beginner in R

  • You can decide to only store p-values. You can also decide to parallelize your code so it runs X times faster, for example if you have 4 cores, you can run the code on 3 cores so it runs 3 times faster. – Yacine Hajji Feb 24 '22 at 14:11
  • 1
    Still keep in mind that you are performing more than 100.000 tests! So I would suggest you start with maybe the first 100 columns and check how long this takes. Then repeat with the first 1000 columns and check again. This should give you a good estimate of how long it runs with 126k tests. Your code itself is fine (except the option of parallelizing it as suggested above). – deschen Feb 24 '22 at 14:18
  • @YacineHajji How would you be able to only store p values? Or parallelize your code? Is parallelizing this as simple as using splitting cols in cols 1/cols2/cols3 and assigning 1/3 of the data.frame to these? As I said, I'm new to R, I did a short R course and mainly search on stack overflow for answers. – Jonas De Leeuw Feb 24 '22 at 14:25
  • You can store p-values (if you only need p-values, and I believe so as you have 100K results), by mentioning `kruskal.test(reformulate("Group", x), data = KWR)$p.value` – Yacine Hajji Feb 24 '22 at 15:59

2 Answers2

0

Take a look at kruskaltests in the Rfast package. For the KWR data.frame, it appears it would be something like:

allKWR <- Rfast::kruskaltests(as.matrix(KWR[,1:124306]), as.numeric(as.factor(KWR$Group)))
jblood94
  • 10,340
  • 1
  • 10
  • 15
0

This was great - I got 50 columns and several hundred cases in 0.01 system time.

Barry DeCicco
  • 251
  • 1
  • 7