1

I'm trying to loop through a large dataframe [5413 columns] and run an ANOVA on each column, however I'm getting an error when trying to do so.

I'd like to have the P value from the ANOVA written to a new row in a dataframe containing the column titles. But limited my current knowledge I'm writing the P-value outputs to files I can parse through in bash.

Here's an example layout of the data:

data()
Name, Group, aaaA, aaaE, bbbR, cccD
Apple, Fruit, 1.23, 0.45, 0.3, 1.1
Banana, Fruit, 0.54, 0.12, 2.0, 1.32
Carrot, Vegetable, 0.01, 0.05, 0.45, 0.9
Pear, Fruit, 0.1, 0.2, 0.1, 0.3
Fox, Animal, 1.0, 0.9, 1.2, 0.8
Dog, Animal, 1.2, 1.1, 0.8, 0.7

And here is the output from dput:

structure(list(Name = structure(c(1L, 2L, 3L, 6L, 5L, 4L), .Label = c("Apple", 
"Banana", "Carrot", "Dog", "Fox", "Pear"), class = "factor"), 
    Group = structure(c(2L, 2L, 3L, 2L, 1L, 1L), .Label = c(" Animal", 
    " Fruit", " Vegetable"), class = "factor"), aaaA = c(1.23, 
    0.54, 0.01, 0.1, 1, 1.2), aaaE = c(0.45, 0.12, 0.05, 0.2, 
    0.9, 1.1), bbbR = c(0.3, 2, 0.45, 0.1, 1.2, 0.8), cccD = c(1.1, 
    1.32, 0.9, 0.3, 0.8, 0.7)), class = "data.frame", row.names = c(NA, 
-6L))

To get a successful output from one I do:

summary(aov(aaaA ~ Group, data=data))[[1]][["Pr(>F)"]]

I then try to implement that in a loop:

for(i in names(data[3:6])){
out <- summary(aov(i ~ Group, data=data))[[1]][["Pr(>F)"]]
write.csv(out, i)}

Which returns the error:

Error in model.frame.default(formula = i ~ Group, data = test, drop.unused.levels = TRUE) : 
variable lengths differ (found for 'Group')

Can anyone help with getting around the error or implementing a per-column ANOVA?

1 Answers1

0

We can do the following and later get the p values:

to_use<-setdiff(names(df),"aaaA")
lapply(to_use,function(x) summary(do.call(aov,list(as.formula(paste("aaaA","~",x)),
                                           data=df))))

This gives you:

[[1]]
            Df Sum Sq Mean Sq
Name         5   1.48   0.296

[[2]]
            Df Sum Sq Mean Sq F value Pr(>F)
Group        2 0.8113  0.4057   1.819  0.304
Residuals    3 0.6689  0.2230               

[[3]]
            Df Sum Sq Mean Sq F value Pr(>F)  
aaaE         1 0.9286  0.9286   6.733 0.0604 .
Residuals    4 0.5516  0.1379                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[[4]]
            Df Sum Sq Mean Sq F value Pr(>F)
bbbR         1  0.043  0.0430    0.12  0.747
Residuals    4  1.437  0.3593               

[[5]]
            Df Sum Sq Mean Sq F value Pr(>F)
cccD         1 0.1129  0.1129    0.33  0.596
Residuals    4 1.3673  0.3418 
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • 1
    That's perfect, thank you! I can match the rest using bash. – Michael Crichton Jun 12 '19 at 12:32
  • I'm now getting a different error on the working dataset: `Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels ` I can't paste the full output from dput due to the character limitation (Some 585k characters). – Michael Crichton Jun 12 '19 at 12:42
  • Oh, that's a somewhat common issue. Do you need to use `factors`? Might be better(not sure) to convert everything to character. The issue is `aov` requires two or more levels. – NelsonGon Jun 12 '19 at 12:43
  • See here for more: https://stackoverflow.com/questions/18171246/error-in-contrasts-when-defining-a-linear-model-in-r – NelsonGon Jun 12 '19 at 12:44
  • 1
    Thanks, I'll have a tinker around. Might be those N.A's that are causing the issue. – Michael Crichton Jun 12 '19 at 12:47