5

I use the following data.frame as an example:

d <- data.frame(x=c(1,NA), y=c(2,3))

I'd like to sum up the values of y by the variable x. Since there is no common value of x, I would expect aggregation to just give me the original data.frame back, where NA is treated as a group. But aggregation gives me the following results.

>aggregate(y ~ x, data=d, FUN=sum)
  x y
1 1 2

I've read the documentation about changing the default actions of na.action, but it doesn't seem to give me anything meaningful.

>aggregate(y ~ x, data=d, FUN=sum, na.action=na.pass)
  x y
1 1 2

What is going on? I don't seem to understand what na.pass is doing in this case. Is there an option to accomplish what I want in R? Any help would be greatly appreciated.

Sanias
  • 53
  • 7
  • 2
    You're saying that you are considering an `NA` value as a grouping variable? Do you want the `NA` or not? (It's not clear because you're also using `na.rm = TRUE` as part of your testing.... – A5C1D2H2I1M1N2O1R2T1 Nov 18 '15 at 15:27
  • Yes, I want NA as a group. – Sanias Nov 18 '15 at 15:31
  • The documentation says "Rows with missing values in any of the by variables will be omitted from the result." If you don't want this, you need to recode your `by` variable or use a different function for aggregation. – Roland Nov 18 '15 at 15:34

1 Answers1

7

aggregate makes use of tapply, which in turn makes use of factor on its grouping variable.

But, look at what happens with NA values in factor:

factor(c(1, 2, NA))
# [1] 1    2    <NA>
# Levels: 1 2

Note the levels. You can make use of addNA to keep the NA:

addNA(factor(c(1, 2, NA)))
# [1] 1    2    <NA>
# Levels: 1 2 <NA>

Thus, you would probably need to do something like:

aggregate(y ~ addNA(x), d, sum)
#   addNA(x) y
# 1        1 2
# 2     <NA> 3

Or something like:

d$x <- addNA(factor(d$x))
str(d)
# 'data.frame': 2 obs. of  2 variables:
#  $ x: Factor w/ 2 levels "1",NA: 1 2
#  $ y: num  2 3
aggregate(y ~ x, d, sum)
#      x y
# 1    1 2
# 2 <NA> 3

(Alternatively, make the upgrade to something like "data.table", which will not just be faster than aggregate, but which will also give you more consistent behavior with NA values. No need to pay heed to whether you're using the formula method of aggregate or not.)

library(data.table)
as.data.table(d)[, sum(y), by = x]
#     x V1
# 1:  1  2
# 2: NA  3
Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks, I appreciate it. I am curious about the functionality of na.pass. The documentation says that it "returns the object unchanged." So why are the NA's seemingly being removed? – Sanias Nov 18 '15 at 17:30
  • @Sanias, that is in reference to the columns being aggregated, not the "by" columns. – A5C1D2H2I1M1N2O1R2T1 Nov 18 '15 at 17:31