54

Here's a simple data frame with a missing value:

M = data.frame( Name = c('name', 'name'), Col1 = c(NA, 1) , Col2 = c(1, 1))
#   Name Col1 Col2
# 1 name   NA    1
# 2 name    1    1

When I use aggregate to sum variables by group ('Name') using the formula method:

aggregate(. ~ Name, M, FUN = sum, na.rm = TRUE)

the result is:

# RowName Col1 Col2
#    name    1    1

So the entire first row, which have an NA, is ignored. But if use the "non-formula" specification:

aggregate(M[, 2:3], by = list(M$Name), FUN = sum, na.rm = TRUE)

the result is:

# Group.1 Col1 Col2
#    name    1    2

Here only the (1,1) entry is ignored.

This caused a major debugging headache in one of my code, since I thought these two calls were equivalent. Is there a good reason why the formula entry method is treated differently?

starball
  • 20,030
  • 7
  • 43
  • 238
Ryan Walker
  • 3,176
  • 1
  • 23
  • 29

2 Answers2

67

Good question, but in my opinion, this shouldn't have caused a major debugging headache because it is documented quite clearly in multiple places in the manual page for aggregate.

First, in the usage section:

## S3 method for class 'formula'
aggregate(formula, data, FUN, ...,
          subset, na.action = na.omit)

Later, in the description:

na.action: a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.


I can't answer why the formula mode was written differently---that's something the function authors would have to answer---but using the above information, you can probably use the following:

aggregate(.~Name, M, FUN=sum, na.rm=TRUE, na.action=NULL)
#   Name Col1 Col2
# 1 name    1    2
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 18
    -1 for the first sentence (sure it looks easy now that you know exactly what you're looking for, but this would be smth quite non-trivial to find irl) – eddi May 30 '13 at 19:57
  • 6
    @eddi, no problem. I know from your chat and comment histories that you like functions to work like you want them to rather than how they are documented, and you are entirely open to that opinion. – A5C1D2H2I1M1N2O1R2T1 May 30 '13 at 20:00
  • 7
    @eddi -- Really, a downvote for that?? I think Ananda makes a worthwhile point there... Carefully reading the help docs, sooner rather than later, is a very good habit to learn, and will save many headaches down the road! – Josh O'Brien May 30 '13 at 20:00
  • 2
    @AnandaMahto - haha, rather I like functions to be consistent across different use cases; but I elaborated more on the -1 above - it has more to do with you thinking that this is easy to find, just because there is mention of this (again, inconsistent) behavior in the manual – eddi May 30 '13 at 20:05
  • 8
    @eddi -- Sounds like you'd *actually* like to downvote the author of `aggregate.formula` ;) But, given that methods sometimes do use inconsistent defaults, where else than the manual *should* they be documented? The positive value of Ananda's comment is that it reminds the OP (and others) that, in this inconsistent world of ours, **reading the manual saves headaches**! – Josh O'Brien May 30 '13 at 20:14
  • 3
    @JoshO'Brien I really would :) And Anando got it for endorsing their bad behavior. The reason I downvoted this answer is because while it is true that "reading the manual *can* save headaches", I have a hard time imagining how it would here. The way you become aware of this particular issue is likely through pain and not through reading the manual. You *can* use the manual later to confirm the source of your pain of course, but then that manual should be regarded as a badly behaving child rather than some sort of a bible to be put on a pedestal. /end of nonsensical comparisons – eddi May 30 '13 at 20:23
  • 1
    *Ananda, sorry for misspelling – eddi May 30 '13 at 20:29
  • 4
    FWIW, when _I_ read the documentation quoted, I would interpret that to mean that just the NA values are removed, not entire rows where there are _any_ NAs. Perhaps a more experienced R user would find it obvious, but I did not. All that would really be necessary to say is to use `na.action=na.pass`. That was the solution I was looking for (in a similar situation to the asker). – big_m Feb 20 '16 at 22:28
  • 1
    May I just add that the documentation is not so good? I am just arriving to this AFTER reading it. It is clear what the function is, but what are the options? na.action = na.omit returns to me "invalid 'type' (closure) of argument". Is there anywhere with a proper documentation about aggregate or na.omit that explains well its use? Would be very grateful for any leads... – Pladiona Nov 26 '21 at 15:01
22

If you want the formula version to be equivalent try this:

M = data.frame( Name = rep('name',5), Col1 = c(NA,rep(1,4)) , Col2 = rep(1,5))
aggregate(. ~ Name, M, function(x) sum(x, na.rm=TRUE), na.action = na.pass)
Rorschach
  • 31,301
  • 5
  • 78
  • 129