2

I would like to aggregate a data frame while also adding in a new column (N) that counts the number of rows per value of the grouping variable, in base R.

This is trivial in dplyr:

library(dplyr)
data(iris)

combined_summary <- iris %>% group_by(Species) %>% group_by(N=n(), add=TRUE) %>% summarize_all(mean)

> combined_summary
# A tibble: 3 x 6
# Groups:   Species [3]
  Species        N Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>      <int>        <dbl>       <dbl>        <dbl>       <dbl>
1 setosa        50         5.01        3.43         1.46       0.246
2 versicolor    50         5.94        2.77         4.26       1.33 
3 virginica     50         6.59        2.97         5.55       2.03 

I am however in the unfortunate position of having to write this code in an environment that doesn't allow for packages to be used (don't ask; it's not my decision). So I need a way to do this in base R.

I can do it in base R in a long-winded way as follows:

# First create the aggregated tables separately
summary_means <- aggregate(. ~ Species, data=iris, FUN=mean)
summary_count <- aggregate(Sepal.Length ~ Species, data=iris[, c("Species", "Sepal.Length")], FUN=length)

> summary_means
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

> summary_count
     Species Sepal.Length
1     setosa           50
2 versicolor           50
3  virginica           50

# Then rename the count column
colnames(summary_count)[2] <- "N"

> summary_count
     Species  N
1     setosa 50
2 versicolor 50
3  virginica 50

# Finally merge the two dataframes
combined_summary_baseR <- merge(x=summary_count, y=summary_means, by="Species", all.x=TRUE)

> combined_summary_baseR
     Species  N Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa 50        5.006       3.428        1.462       0.246
2 versicolor 50        5.936       2.770        4.260       1.326
3  virginica 50        6.588       2.974        5.552       2.026

Is there any way to do this in a more efficient way in base R?

Boops Boops
  • 713
  • 5
  • 8
  • 2
    Not actually an improvement but turning your step-by-step solution into one liner, `merge(aggregate(. ~ Species, iris, mean), setNames(aggregate(Sepal.Length ~ Species, data=iris, FUN=length), c("Species", "N")), by = "Species")` – Ronak Shah Dec 20 '19 at 00:24
  • 2
    [This answer](https://stackoverflow.com/questions/12064202/apply-several-summary-functions-on-several-variables-by-group-in-one-call) might help – R. Schifini Dec 20 '19 at 00:30

1 Answers1

2

Here is a base R option using a single by call (to aggregate)

do.call(rbind, by(
    iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
#            N Sepal.Length Sepal.Width Petal.Length Petal.Width
#setosa     50        5.006       3.428        1.462       0.246
#versicolor 50        5.936       2.770        4.260       1.326
#virginica  50        6.588       2.974        5.552       2.026

Using colMeans ensures that the column names are carried through which avoids an additional setNames call.


Update

In response to your comment, to have row names as a separate column requires an extra step.

d <- do.call(rbind, by(
    iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
cbind(Species = rownames(d), as.data.frame(d))

Not as concise as the initial by call. I think we're having a clash of philosophies here. In dplyr (and the tidyverse) row names are generally avoided, to be consistent with the principles of "tidy data". In base R row names are common and are (more or less) consistently carried through data operations. So in a way you're asking for a mix of dplyr (tidy) and base R data structure concepts which may not be the best/robust approach.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thank you for this! However this seems to convert the "Species" column into rownames. Is there a way to do this that keeps Species as a regular column? – Boops Boops Dec 20 '19 at 18:46
  • You are right; I am essentially trying to create "tidy" data within the constraints of base R. Not the ideal situation, but I'm trying to manage. The reason why I can't rely on row names is because I need to have a name for that "column". I have to output several such summary tables using different factor variables in our database, not just one Species column like we're doing here. – Boops Boops Dec 21 '19 at 21:43
  • Your latest update does add a column for Species, but the row names are still present, so I'll remove the row names before outputting. Although... since that's yet an extra step, I'm thinking Ronak Shah's suggested one-liner above might be more concise now. – Boops Boops Dec 21 '19 at 21:43
  • @BoopsBoops Just leave the row names as they are. They won't do any harm, so why remove them as long as you have the additional row name column. Ronak's answer is indeed more compact but has two `aggregate` calls instead of one (`by` call). I agree with you in that the situation is not ideal, which is why I am stressing that a more robust/canonical way would be to work within *one* framework (base R: with row names; `dplyr`: row names as column). – Maurits Evers Dec 21 '19 at 23:13