summarizing a dataframe and adding column with mean ± SD

Question

I have a dataframe of many variables (soil properties) for 11 legumes in 2 different locations. first few columns of the data is shown below.

 SPECIES   LOCATION   pH   NO3  NH4    P Organic_C      K   Cu    Mn   Zn   BD X.Sand X.Silt X.Clay
1    C. comosa    Gauteng 5.40  8.24 1.35 1.10      0.95  94.40 3.36 84.40 4.72 1.45   68.0     12      9
2    C. comosa    Gauteng 5.25  8.36 1.37 1.20      0.99  94.87 3.39 84.87 4.77 1.36   76.0     16     13
3    C. comosa    Gauteng 5.55  8.19 1.32 1.11      0.94  94.01 3.35 84.01 4.68 1.54   78.0     14     14
4    C. comosa Mpumalanga 5.84  4.05 3.46 3.04      1.55 130.40 0.28 25.43 2.00 1.66   73.6      9     10
5    C. comosa Mpumalanga 5.49  4.45 3.48 3.09      1.53 131.36 0.27 25.35 2.12 1.45   76.5     11     16
6    C. comosa Mpumalanga 6.19  4.43 3.44 3.04      1.58 129.95 0.29 25.45 2.14 1.87   74.9     13     16
7   C. distans    Gauteng 5.48  8.88 1.96 3.33      0.99 130.24 0.99 40.01 3.94 1.55   70.0      8     11
8   C. distans    Gauteng 5.29  8.54 1.99 3.28      0.99 130.28 0.95 40.25 3.89 1.48   79.0     12     15
9   C. distans    Gauteng 5.67  8.63 1.93 3.39      1.02 130.30 0.98 40.12 3.97 1.62   79.0     10     16
10  C. distans Mpumalanga 5.61  6.02 2.65 4.45      2.58 163.25 1.79 53.11 6.11 1.68   72.0      8     10
11  C. distans Mpumalanga 5.43  6.58 2.55 4.49      2.59 163.55 1.78 52.89 6.04 1.63   78.0     15     14
12  C. distans Mpumalanga 5.79  6.24 2.59 4.41      2.59 163.27 1.75 53.03 6.19 1.73   75.0     16     12
13 E. cordatum    Gauteng 4.38 16.29 5.76 4.77      3.25 175.38 1.11 35.87 8.54 1.53   33.0      9     40
14 E. cordatum    Gauteng 4.05 16.15 5.63 4.73      3.29 175.90 1.23 34.34 8.61 1.42   45.0     13     50
15 E. cordatum    Gauteng 4.71 15.89 5.99 4.80      3.25 174.54 1.19 36.44 8.58 1.64   42.0     14     54

for each location, i want to summarize the data such that the soil properties are in the first column while the legumes are the other columns and each value reported as the mean ± SD. something like the table below.

i am thinking dcast but i am not sure how to get the values as mean ± SD

Daniel O · Accepted Answer · 2020-07-07T19:33:30.183

In Base R this gives you the exact output, But you lose the ability to (easily) extract the numbers afterwards. There are better ways of storing the data if you want to do more with it afterwards.

as.data.frame(t(aggregate(. ~ LOCATION, df[,-1], function(x) paste(round(mean(x),2),"±", round(sd(x),2)))))

                      V1             V2
LOCATION         Gauteng     Mpumalanga
pH           5.09 ± 0.57    5.72 ± 0.28
NO3         11.02 ± 3.83     5.29 ± 1.1
NH4          3.03 ± 2.09    3.03 ± 0.47
P            3.08 ± 1.58    3.75 ± 0.76
Organic_C    1.74 ± 1.14    2.07 ± 0.57
K         133.32 ± 35.08 146.96 ± 17.96
Cu           1.84 ± 1.15    1.03 ± 0.82
Mn         53.37 ± 23.39  39.21 ± 15.12
Zn           5.74 ± 2.15     4.1 ± 2.21
BD           1.51 ± 0.09    1.67 ± 0.14
X.Sand     63.33 ± 18.18      75 ± 2.11
X.Silt          12 ± 2.6      12 ± 3.22
X.Clay     24.67 ± 17.99      13 ± 2.76

Edit:

If you want the results for all species/locations you can use

as.data.frame(t(aggregate(. ~ SPECIES + LOCATION, df, function(x) paste(round(mean(x),2),"±", round(sd(x),2)))))

                    V1            V2            V3            V4            V5
SPECIES      C. comosa    C. distans   E. cordatum     C. comosa    C. distans
LOCATION       Gauteng       Gauteng       Gauteng    Mpumalanga    Mpumalanga
pH          5.4 ± 0.15   5.48 ± 0.19   4.38 ± 0.33   5.84 ± 0.35   5.61 ± 0.18
NO3        8.26 ± 0.09   8.68 ± 0.18   16.11 ± 0.2   4.31 ± 0.23   6.28 ± 0.28
NH4        1.35 ± 0.03   1.96 ± 0.03   5.79 ± 0.18   3.46 ± 0.02    2.6 ± 0.05
P          1.14 ± 0.06   3.33 ± 0.06   4.77 ± 0.04   3.06 ± 0.03   4.45 ± 0.04
Organic_C  0.96 ± 0.03      1 ± 0.02   3.26 ± 0.02   1.55 ± 0.03   2.59 ± 0.01
K         94.43 ± 0.43 130.27 ± 0.03 175.27 ± 0.69 130.57 ± 0.72 163.36 ± 0.17
Cu         3.37 ± 0.02   0.97 ± 0.02   1.18 ± 0.06   0.28 ± 0.01   1.77 ± 0.02
Mn        84.43 ± 0.43  40.13 ± 0.12  35.55 ± 1.09  25.41 ± 0.05  53.01 ± 0.11
Zn         4.72 ± 0.05   3.93 ± 0.04   8.58 ± 0.04   2.09 ± 0.08   6.11 ± 0.08
BD         1.45 ± 0.09   1.55 ± 0.07   1.53 ± 0.11   1.66 ± 0.21   1.68 ± 0.05
X.Sand       74 ± 5.29      76 ± 5.2     40 ± 6.24     75 ± 1.45        75 ± 3
X.Silt          14 ± 2        10 ± 2     12 ± 2.65        11 ± 2     13 ± 4.36
X.Clay       12 ± 2.65     14 ± 2.65     48 ± 7.21     14 ± 3.46        12 ± 2

And if you really only want one location you can use.

as.data.frame(t(aggregate(. ~ SPECIES,df[df$LOCATION == "Gauteng",-2], function(x) paste(round(mean(x),2),"±", round(sd(x),2)))))

                    V1            V2            V3
SPECIES      C. comosa    C. distans   E. cordatum
pH          5.4 ± 0.15   5.48 ± 0.19   4.38 ± 0.33
NO3        8.26 ± 0.09   8.68 ± 0.18   16.11 ± 0.2
NH4        1.35 ± 0.03   1.96 ± 0.03   5.79 ± 0.18
P          1.14 ± 0.06   3.33 ± 0.06   4.77 ± 0.04
Organic_C  0.96 ± 0.03      1 ± 0.02   3.26 ± 0.02
K         94.43 ± 0.43 130.27 ± 0.03 175.27 ± 0.69
Cu         3.37 ± 0.02   0.97 ± 0.02   1.18 ± 0.06
Mn        84.43 ± 0.43  40.13 ± 0.12  35.55 ± 1.09
Zn         4.72 ± 0.05   3.93 ± 0.04   8.58 ± 0.04
BD         1.45 ± 0.09   1.55 ± 0.07   1.53 ± 0.11
X.Sand       74 ± 5.29      76 ± 5.2     40 ± 6.24
X.Silt          14 ± 2        10 ± 2     12 ± 2.65
X.Clay       12 ± 2.65     14 ± 2.65     48 ± 7.21

Thank you for that. I was looking to first filter the data for say LOCATION == “Gauteng”, then aggregate the soil properties and SPECIES. From your result, the first column is okay, but I want the other columns to be the SPECIES — ngwoke chukwubuikem, Jul 07 '20 at 19:25
Hello Daniel, Thank you for the assistance. Yes, I want to do more with the data. I want to test for significant difference using ANOVA test ```aov``` and ```TukeyHSD```. That is using the dataframe aggregated only by location (the last dataframe in your answer), i want to test for significance difference in the values of pH (and the other porperties) among the legumes. — ngwoke chukwubuikem, Jul 07 '20 at 23:31
most of the code I provided was just to get it in the exact format you asked for. If you want usable data. all you really need is. `aggregate(. ~ SPECIES, df[df$LOCATION == "Gauteng",-2], mean)` — Daniel O, Jul 08 '20 at 00:16

summarizing a dataframe and adding column with mean ± SD

1 Answers1