-1

I've looked for an answer but I still have not found a solution, I'm still new to R. My data frame displays measurement of one ecological trait (relative soil coverage) for ~70 plant species in different conditions: different years, different chemical treatment and presence/absence of greenhouse.

I need to summarize that data into a new dataframe that displays the mean and sd of the trait for each species and for each combination of factors (conditions). I know aggregate or lapply could help but i struggle to combine both grouping by 3 different factors and the multiples species, which implies the need of an "automated" code.

I'm sorry if I missed a post answering my question

Thank you for your patience and help

Edit: here is a reproducible example, hope i do that correctly:

mydata<-structure(list(Year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
2011L), Replicate = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L), Treatment = structure(c(1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A", 
"B"), class = "factor"), Greenhouse = structure(c(2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("No", 
"Yes"), class = "factor"), Sp_1 = c(4L, 0L, 2L, 5L, 4L, 0L, 2L, 
5L, 0L, 0L, 4L, 6L, 4L, 0L, 2L, 5L), Sp_2 = c(7L, 0L, 1L, 1L, 
7L, 0L, 1L, 1L, 7L, 0L, 1L, 1L, 6L, 0L, 1L, 1L), Sp_3 = c(8L, 
2L, 2L, 1L, 8L, 2L, 2L, 1L, 10L, 2L, 1L, 1L, 4L, 2L, 2L, 1L)), class = "data.frame", row.names = c(NA, 
-16L))

I only put 3 species in that example but as i've said i have over 70 species so i need something that could select all the species columns ( mydata[,5:75] ? something along those lines) more than c("sp_1","sp_2",..., "sp_70").

I'd like the output to look like this:

Year   Treatment   Greenhouse   Sp_1_mean   Sp_1_sd   Sp_2_mean   Sp_2_sd 
2010   A           Yes          x           x         x           x
2010   A           No           x           x         x           x
2010   B           Yes          x           x         x           x
2010   B           No           x           x         x           x
2011   A           Yes          x           x         x           x
2011   A           No           x           x         x           x
2011   B           Yes          x           x         x           x
2011   B           No           x           x         x           x

Here is a dput() showing what the desired output should look like

    desired_output<-structure(list(Year = c(2010L, 2010L, 2010L, 2010L, 2011L, 2011L, 
2011L, 2011L), Treatment = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 
2L, 2L), .Label = c("A", "B"), class = "factor"), Greenhouse = structure(c(2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    Sp_1_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    Sp_1_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    Sp_2_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    Sp_2_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    Sp_3_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    Sp_3_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor")), class = "data.frame", row.names = c(NA, 
-8L))

I hope that's clearer! Thanks

mona
  • 41
  • 4
  • Can you add a reproducible sample? If you want to know how to get a reproducible sample, refer this. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Mohanasundaram Apr 24 '20 at 14:23
  • Hi to better understand your question it would be helpful if you could provide a sample or fictive sample of your datausing dput () .. and also how your desired output would look like as a data frame. – user12256545 Apr 24 '20 at 14:24

1 Answers1

0

With data.table, you can do something in this flavor:

library(data.table)
setDT(df)
df[, lapply(.SD, function(x) return(c(mean(x), sd(x))),
                 by = c("col1","col2"),
                .SDcols = c("x1","x2")]

(without reproducible example it is hard to give you more precise syntax)

This means: apply mean and sd for each subset of data (here columns x1 and x2) by groups (col1 and col2)

Example

library(data.table)
df <- as.data.table(mtcars)
output <- df[, lapply(.SD, function(x) return(c(mean(x, na.rm = TRUE), sd(x, na.rm = TRUE)))),
            .SDcols = c("disp","drat"),
            by = c("cyl","gear")]
output[, 'stat' := c("mean","sd"), by = c("cyl","gear")]

output
    cyl gear       disp       drat stat
 1:   6    4 163.800000 3.91000000 mean
 2:   6    4   4.387862 0.01154701   sd
 3:   4    4 102.625000 4.11000000 mean
 4:   4    4  30.742699 0.37156042   sd
 5:   6    3 241.500000 2.92000000 mean
 6:   6    3  23.334524 0.22627417   sd
 7:   8    3 357.616667 3.12083333 mean
 8:   8    3  71.823494 0.23027487   sd
 9:   4    3 120.100000 3.70000000 mean
10:   4    3         NA         NA   sd
11:   4    5 107.700000 4.10000000 mean
12:   4    5  17.819091 0.46669048   sd
13:   8    5 326.000000 3.88000000 mean
14:   8    5  35.355339 0.48083261   sd
15:   6    5 145.000000 3.62000000 mean
16:   6    5         NA         NA   sd

Here, I had a column to know which statistics each row is about

Edit with your reproducible example

setDT(mydata)
output <- mydata[, lapply(.SD, function(x) return(c(mean(x, na.rm = TRUE), sd(x, na.rm = TRUE)))),
       .SDcols = c("Sp_1", "Sp_2", "Sp_3"),
       by = c("Year", "Treatment", "Greenhouse")
       ]
output[, 'stat' := c('mean','sd') ,
       by = c("Year", "Treatment", "Greenhouse")]

Since you are interested in a wide format, you can use dcast to reshape your data.

output <- dcast(output, Year + Treatment + Greenhouse ~  ...,
      value.var = c("Sp_1", "Sp_2", "Sp_3"))

output

Year Treatment Greenhouse Sp_1_mean   Sp_1_sd Sp_2_mean   Sp_2_sd Sp_3_mean   Sp_3_sd
1: 2010         A         No       2.0 0.0000000       1.0 0.0000000       2.0 0.0000000
2: 2010         A        Yes       4.0 0.0000000       7.0 0.0000000       8.0 0.0000000
3: 2010         B         No       5.0 0.0000000       1.0 0.0000000       1.0 0.0000000
4: 2010         B        Yes       0.0 0.0000000       0.0 0.0000000       2.0 0.0000000
5: 2011         A         No       3.0 1.4142136       1.0 0.0000000       1.5 0.7071068
6: 2011         A        Yes       2.0 2.8284271       6.5 0.7071068       7.0 4.2426407
7: 2011         B         No       5.5 0.7071068       1.0 0.0000000       1.0 0.0000000
8: 2011         B        Yes       0.0 0.0000000       0.0 0.0000000       2.0 0.0000000

This long to wide conversion could be avoided with a slight modification of the aggregation.

linog
  • 5,786
  • 3
  • 14
  • 28
  • Thank you, I added a reproducible example and the desired output and more information in the initial post – mona Apr 24 '20 at 15:08
  • Ok, I updated the answer. If you find it satisfactory, you can accept it ([guidelines here if you don't know how to do](https://stackoverflow.com/help/someone-answers) – linog Apr 24 '20 at 15:26
  • In your 70 variables example, you should use `paste0("Sp_", 1:70)` instead of `c("Sp_1", "Sp_2", "Sp_3")` – linog Apr 24 '20 at 15:35
  • Thank you very much that worked perfectly. However, in my data, species names are not "sp_1" but the latin names. Do you know how i can write the `dcast` line? Do i have to create a vector with all the species names and put it in `value.var` or is there an easier way? – mona Apr 24 '20 at 17:30
  • There's probably an easier solution but I did not find it when trying your example. You can create the vector using the column names in `mydata` (excuding those you use to group data), and re-use it at the different steps of the solution – linog Apr 24 '20 at 18:02