Performing multiple functions (mean, sd, etc.) on all numeric columns in a dataframe with ddply()

Question

I am not quite an R novice, but I'm trying to teach myself how to use plyr, since in many cases it's much faster than writing endless for loops! However, I've run into a problem that I can't seem to find an answer for here, in the documentation for plyr, or anywhere else-- at least, not as I'm able to identify them as such. I won't rule out that they're there and I'm just not recognizing them!

I have a dataset of many columns, and I'm looking for a way to perform multiple functions on all of the columns without having to copy the code and just change a single argument. I've successfully found and used numcolwise(sd) to get the standard deviation of every numeric column, which was my first big hurdle. I was not about to type the names of every column in my dataset! Example code with the 'iris' dataset, because my dataset is obnoxious:

n<-ddply(iris,"Species",numcolwise(sd)) #Calculate the sd for all numeric columns in the dataset

and I get this:

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa    0.3524897   0.3790644    0.1736640   0.1053856
2 versicolor    0.5161711   0.3137983    0.4699110   0.1977527
3  virginica    0.6358796   0.3224966    0.5518947   0.2746501

And that totally works and does what I wanted. I can even make the column names indicate that they're standard deviation:

colnames(n)[2:5]<-paste(colnames(s)[2:5],".sd",sep="") #append .sd to all column names

And that's all great, and I definitely couldn't do that before yesterday.

Okay, so here's where my problem comes in. I'm trying to be as efficient as possible, and I would rather not just copy and rerun the ddply function and colnames function multiple times to end up with a dataframe for sd, another dataframe for mean, and yet another for se. In addition, assuming I could find a way to supply multiple functions as arguments for numcolwise(), I don't know what I would do about the column names.

I know there are ways to calculate mean, sd, and whatever else using summarize(), and that when you do that, you can specify the names of columns (see Set column name ddply). But I can't figure out how, or if, the method used there with summarize can be used with numcolwise() and multiple function arguments (sd, mean, ...) to get something like this:

     Species Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean
1     setosa       0.3524897      0.3790644       0.1736640      0.1053856             5.006            3.428             1.462            0.246
2 versicolor       0.5161711      0.3137983       0.4699110      0.1977527             5.936            2.770             4.260            1.326
3  virginica       0.6358796      0.3224966       0.5518947      0.2746501             6.588            2.974             5.552            2.026

Note: I know I could do this using a sort of "brute force" method using join(), because I've done that with other datasets that I needed to smush together. But that seems somewhat inelegant and repetitive, and I'm eventually going to have an even larger dataset to do this with, because right now I'm just working with my pilot data.

count · Accepted Answer · 2017-08-15T13:43:56.120

3

It´s pretty straightforward with dplyr:

require(dplyr)
iris %>% group_by(Species) %>% summarise_all(funs(mean,sd))

edited Aug 15 '17 at 13:43

answered Mar 16 '17 at 14:08

count

1,328
9
16

See, I just knew it'd be something gloriously simple that I'd somehow missed with every google search I've tried. Thank you so much! – DirtAndPlantScientist Mar 16 '17 at 14:58
1

since an update of dplyr that last function is now called `summarise_all()` – Leo Aug 15 '17 at 13:15
Is there a way to do this where only numeric columns are summarized? – milo Jul 18 '18 at 14:40
@milo `summarize_if(is.numeric, funs(mean, sd))` – Kevin Feb 17 '20 at 06:02

Performing multiple functions (mean, sd, etc.) on all numeric columns in a dataframe with ddply()

1 Answers1