3

I aggregate data containing NAs and therefore I include na.action = NULL as explained here. Here is the code that works:

# Toy data.
df <- data.frame(x= 1:10, group= rep(1:2, 5), other_var= rnorm(10))
# Aggragate with formula.
aggregate(formula= x ~ group, data= df, na.action= NULL, FUN= function(i) sum(i))

In my situation I can not provide variable names as formula because they can change. Thus, I provide them with a string vecor in x and by argument like that:

var_names <- c("x", "group")
aggregate(x= df[ , var_names[1]],  by= list(df[ , var_names[2]]), na.action= NULL, FUN= function(i) sum(i))

This results in an error. Interestingly, leaving out na.action= NULL, e.g. aggregate(x= df[ , var_names[1]], by= list(df[ , var_names[2]]), FUN= function(i) sum(i)), does not end with an error but returns the expected output. How can I avoid that rows containing NAs disappear while providing column names as a vetor? I do need to include na.action= NULL because my real data contains NAs.

  • Study the documentation. Only `aggregate.formula` has an `na.action` argument. – Roland Dec 14 '21 at 14:17
  • 2
    I find your example confusing. The linked post describes a different situation, namely if you have `.` on the LHS of the formula. Since you don't have that, I don't understand why you fiddle with the `na.action` argument at all. – Roland Dec 14 '21 at 14:34
  • @Roland You are right, my example data is not good. It does not even contain NAs. I just made data to reproduce the error. Of course my real data looks different. –  Dec 14 '21 at 14:57

3 Answers3

1

You don't have to use the column names in aggregate.formula.
na.pass should solve your na.action requirements.

setNames( 
   aggregate( cbind(df[,1], df[,3]) ~ df[,2], df, sum, na.rm=T, 
   na.action=na.pass ), colnames(df[,c(2,1,3)]) )
  group  x  other_var
1     1 25 -0.7313815
2     2 30  0.3231317

Data

(I added NAs)

df <- structure(list(x = 1:10, group = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L), other_var = c(-1.79458090358371, 0.295106071151792, 
NA, -0.589487588239041, 0.325944874015228, NA, 0.737254570399201, 
0.47849317537615, NA, 0.139020009150021)), row.names = c(NA, 
-10L), class = "data.frame")

Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
1

I'm not entirely sure what the issue is: assigning na.action=NULL means to ignore them and pass any values including their NAs to the function, untouched. This is what will happen by default in the non-formula version.

So I suggest you just omit na.action.

Using mtcars:

mt <- mtcars
mt$mpg[3] <- NA
var_names <- c("mpg", "cyl")

First, the formula variant:

aggregate(
  as.formula(paste(var_names[1], "~", var_names[2])), data= mt,
  na.action= NULL,
  FUN= function(i) sum(i))
#   cyl   mpg
# 1   4    NA
# 2   6 138.2
# 3   8 211.4

Second, the non-formula failure:

aggregate(
  x= mt[ , var_names[1]],  by= list(mt[ , var_names[2]]),
  na.action= NULL,
  FUN= function(i) sum(i))
# Error in FUN(X[[i]], ...) : unused argument (na.action = NULL)

Fixing it:

aggregate(
  x= mt[ , var_names[1]],  by= list(mt[ , var_names[2]]),
  # na.action= NULL,
  FUN= function(i) sum(i))
#   Group.1     x
# 1       4    NA
# 2       6 138.2
# 3       8 211.4

Optionally if you want a sum for that first group, then handle it in the function itself:

aggregate(
  x= mt[ , var_names[1]],  by= list(mt[ , var_names[2]]),
  FUN= function(i) sum(i, na.rm=TRUE))
#   Group.1     x
# 1       4 270.5
# 2       6 138.2
# 3       8 211.4
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Okay, so is there a reason you choose to *not* use the formula version (as demonstrated in my answer)? It is the only `aggregate` method that uses the `na.action=` which you think you must use. – r2evans Dec 14 '21 at 14:46
0

This code should solve your problem.

aggregate(x = df[which(!is.na(df[var_names[1]])), var_names[1]],
      by = list(df[which(!is.na(df[var_names[1]])), var_names[2]]),
      FUN = function(i) sum(i))
Haci Duru
  • 456
  • 3
  • 9