11

I am to compute multiple quantiles for a certain variable:

> res1 <- aggregate(airquality$Wind, list(airquality$Month), function (x) quantile(x, c(0.9, 0.95, 0.975)))
> head(res1)
  Group.1   x.90%   x.95% x.97.5%
1       5 16.6000 17.5000 18.8250
2       6 14.9000 15.5600 17.3650
3       7 14.3000 14.6000 14.9000
4       8 12.6000 14.0500 14.6000
5       9 14.9600 15.5000 15.8025

The result looks good at first, but aggregate actually returns it in a very strange form, where the last 3 columns are not columns of a data.frame, but a single matrix!

> names(res1)
[1] "Group.1" "x"      
> dim(res1)
[1] 5 2
> class(res1[,2])
[1] "matrix"

This causes a lot of problems in further processing.

Few questions:

  1. Why is aggregate() behaving so strange?
  2. Is there any way to persuade it to make the result I expect?
  3. Or am I perhaps using a wrong function for this purpose? Is there any other prefered way to get the wanted result?

Of course I could do some transformation of the output of aggregate(), but I look for some more simple and straightforward solution.

Tomas
  • 57,621
  • 49
  • 238
  • 373
  • 1
    I agree that this behavior is horribly annoying. Functions should never auto-convert outputs, as it breaks downstream functions. Simplify should instead allow an output type that is explicit (e.g., data.frame, c, matrix). – Adam Erickson Sep 13 '18 at 23:55
  • The second question is also answered here: [Apply several summary functions on several variables by group in one call](https://stackoverflow.com/a/12064297/1851712) – Henrik Oct 22 '18 at 14:24

1 Answers1

15

Q1: Why is the behavior so strange?

This is actually a documented behavior at ?aggregate (though it may still be unexpected). The relevant argument to look at would be simplify.

If simplify is set to FALSE, aggregate would produce a list instead in a case like this.

res2 <- aggregate(airquality$Wind, list(airquality$Month), function (x) 
  quantile(x, c(0.9, 0.95, 0.975)), simplify = FALSE)
str(res2)
# 'data.frame':  5 obs. of  2 variables:
#  $ Group.1: int  5 6 7 8 9
#  $ x      :List of 5
#   ..$ 1  : Named num  16.6 17.5 18.8
#   .. ..- attr(*, "names")= chr  "90%" "95%" "97.5%"
#   ..$ 32 : Named num  14.9 15.6 17.4
#   .. ..- attr(*, "names")= chr  "90%" "95%" "97.5%"
#   ..$ 62 : Named num  14.3 14.6 14.9
#   .. ..- attr(*, "names")= chr  "90%" "95%" "97.5%"
#   ..$ 93 : Named num  12.6 14.1 14.6
#   .. ..- attr(*, "names")= chr  "90%" "95%" "97.5%"
#   ..$ 124: Named num  15 15.5 15.8
#   .. ..- attr(*, "names")= chr  "90%" "95%" "97.5%"

Now, both a matrix and a list as columns may seem to be strange behavior, but I presume it's more of a case of "status by design" rather than a "bug" or a "flaw".

For instance, consider the following: We want to aggregate both the "Wind" and the "Temp" columns from the "airquality" dataset, and we know that each aggregation would result in multiple columns (like we would expect with quantile).

res3 <- aggregate(cbind(Wind, Temp) ~ Month, airquality, 
                  function (x) quantile(x, c(0.9, 0.95, 0.975)))
res3
#   Month Wind.90% Wind.95% Wind.97.5% Temp.90% Temp.95% Temp.97.5%
# 1     5  16.6000  17.5000    18.8250   74.000   77.500     79.500
# 2     6  14.9000  15.5600    17.3650   87.300   91.100     92.275
# 3     7  14.3000  14.6000    14.9000   89.000   91.500     92.000
# 4     8  12.6000  14.0500    14.6000   94.000   95.000     96.250
# 5     9  14.9600  15.5000    15.8025   91.100   92.550     93.000

In some ways, keeping these values as matrix-columns might make sense--the data aggregated data are easily accessible by their original column names:

res3$Temp
#       90%   95%  97.5%
# [1,] 74.0 77.50 79.500
# [2,] 87.3 91.10 92.275
# [3,] 89.0 91.50 92.000
# [4,] 94.0 95.00 96.250
# [5,] 91.1 92.55 93.000

Q2: How do you get the results as separate columns in a data.frame?

But a list as a column is just as awkward to deal with as a matrix as a column in many cases. If you want to "flatten" your matrix into columns, use do.call(data.frame, ...):

do.call(data.frame, res1)
#   Group.1 x.90. x.95. x.97.5.
# 1       5 16.60 17.50 18.8250
# 2       6 14.90 15.56 17.3650
# 3       7 14.30 14.60 14.9000
# 4       8 12.60 14.05 14.6000
# 5       9 14.96 15.50 15.8025
str(.Last.value)
# 'data.frame':  5 obs. of  4 variables:
#  $ Group.1: int  5 6 7 8 9
#  $ x.90.  : num  16.6 14.9 14.3 12.6 15
#  $ x.95.  : num  17.5 15.6 14.6 14.1 15.5
#  $ x.97.5.: num  18.8 17.4 14.9 14.6 15.8a

Q3: Are there other alternatives?

As with most things R, yes of course. My preferred alternative would be to use the "data.table" package, with which you can do:

library(data.table)
as.data.table(airquality)[, as.list(quantile(Wind, c(.9, .95, .975))), 
                          by = Month]
#    Month   90%   95%   97.5%
# 1:     5 16.60 17.50 18.8250
# 2:     6 14.90 15.56 17.3650
# 3:     7 14.30 14.60 14.9000
# 4:     8 12.60 14.05 14.6000
# 5:     9 14.96 15.50 15.8025
str(.Last.value)
# Classes ‘data.table’ and 'data.frame':  5 obs. of  4 variables:
#  $ Month: int  5 6 7 8 9
#  $ 90%  : num  16.6 14.9 14.3 12.6 15
#  $ 95%  : num  17.5 15.6 14.6 14.1 15.5
#  $ 97.5%: num  18.8 17.4 14.9 14.6 15.8
#  - attr(*, ".internal.selfref")=<externalptr> 
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    Thanks Ananda. The result for `simplify = FALSE` is even more strange and even far from what I want. The trick is to postprocess with `do.call(data.frame, res1)`, which is not very straightforward, but does the job. I find it kind of a flaw of `aggregate` that this output cannot be achieved easier. – Tomas Jan 14 '15 at 14:58
  • @TMS, see my update to Q1, which explains why I don't think it is necessarily a flaw. – A5C1D2H2I1M1N2O1R2T1 Jan 14 '15 at 15:12
  • Ananda, I don't argue that current output is not useful in certain situations; I only say it is a flaw that there is no easy way to get the standard data.frame, which is what you usually want. – Tomas Jan 14 '15 at 15:16
  • @TMS, but doesn't `do.call(data.frame, res3)` flatten everything out pretty easily (even for the "res3" example I shared)? – A5C1D2H2I1M1N2O1R2T1 Jan 14 '15 at 15:22
  • I already told you in my comment that yes, it does. But it feels kind of clumsy to have to do that. – Tomas Jan 14 '15 at 15:28
  • 1
    @TMS, out of curiosity, what would you expect the output for the "data.table" approach to be if I had used `list` instead of `as.list`? Try it. Is it what you expected? Functions don't always work the way we expect them to. At least we should appreciate it when they are documented and indication what you should expect. This is only happening because your aggregation function outputs more than one value. This can sometimes seem contradictory to the idea of an "aggregate". See, for example, [this discussion about "dplyr" functionality](https://github.com/hadley/dplyr/issues/154) in such cases. – A5C1D2H2I1M1N2O1R2T1 Jan 14 '15 at 19:45