Use aggregate and keep NA rows

Question

I have not spent such a time on one single task like this for years.

There are multiple hints here on SO for example: here or here so one is tempted to say this is a duplicate (I would even say so). But with the examples and multiple trials I was not able to accomplish what's needed.

Here is full example:

x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))

x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))

lpp <- lapply(spl, 
          function(x) { r <- with(x, 
              data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
                            val_g_lab=cut(val, seq(0,1,0.1)))); r })


rd <- do.call(rbind, lpp); ord <- rd[order(rd$idx, decreasing = FALSE), ]; ord

aggregate(val ~ group + val_g_lab, ord, 
          FUN=function(x) c(mean(x, na.rm = FALSE), 
                            sum(!is.na(x))), na.action=na.pass)

The desired ouput: I would like to have also the NA's included, after aggregate(). Currently the aggregate() drops the NA's rows.

      idx group        val val_g val_g_lab  
 a.1    1     a 0.53789249     6 (0.5,0.6]          
 b.2    2     b 0.01729695     1   (0,0.1]          
 c.3    3     c 0.62295270     7 (0.6,0.7]          
 d.4    4     d 0.60291892     7 (0.6,0.7]
 e.5    5     e 0.76422909     8 (0.7,0.8]
 f.6    6     f 0.87433547     9 (0.8,0.9]
 g.7    7     g         NA    NA      <NA>          
 h.8    8     h 0.50590159     6 (0.5,0.6]
 i.9    9     i 0.89084068     9 (0.8,0.9]
 ...... continue (full data set as @ord object.

Anders Ellern Bilgrau · Accepted Answer · 2018-10-09T13:44:18.113

A work-around is simply not to use NA for the value groups. First, initialising your data as above:

x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))

x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))

lpp <- lapply(spl, 
      function(x) { r <- with(x, 
          data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
                        val_g_lab=cut(val, seq(0,1,0.1)))); r })


rd <- do.call(rbind, lpp); 
ord <- rd[order(rd$idx, decreasing = FALSE), ];

Simply convert to character and covert NAs to some arbitrary string literal:

# Convert to character
ord$val_g_lab <- as.character(ord$val_g_lab)
# Convert NAs
ord$val_g_lab[is.na(ord$val_g_lab)] <- "Unknown"

aggregate(val ~ group + val_g_lab, ord, 
          FUN=function(x) c(mean(x, na.rm = FALSE), sum(!is.na(x))), 
          na.action=na.pass)
#   group val_g_lab      val.1      val.2
#1      e   (0,0.1] 0.02292533 1.00000000
#2      g (0.1,0.2] 0.16078353 1.00000000
#3      g (0.2,0.3] 0.20550228 1.00000000
#4      i (0.2,0.3] 0.26986665 1.00000000
#5      j (0.2,0.3] 0.23176149 1.00000000
#6      d (0.3,0.4] 0.39196441 1.00000000
#7      e (0.3,0.4] 0.39303518 1.00000000
#8      g (0.3,0.4] 0.35646994 1.00000000
#9      i (0.3,0.4] 0.35724889 1.00000000
#10     a (0.4,0.5] 0.48809261 1.00000000
#11     b (0.4,0.5] 0.40993166 1.00000000
#12     d (0.4,0.5] 0.42394859 1.00000000
# ...
#20     b   (0.9,1] 0.99562918 1.00000000
#21     c   (0.9,1] 0.92018049 1.00000000
#22     f   (0.9,1] 0.91379088 1.00000000
#23     h   (0.9,1] 0.93445802 1.00000000
#24     j   (0.9,1] 0.93325098 1.00000000
#25     b   Unknown         NA 0.00000000
#26     c   Unknown         NA 0.00000000
#27     d   Unknown         NA 0.00000000
#28     i   Unknown         NA 0.00000000
#29     j   Unknown         NA 0.00000000

Does this do what you want?

Edit:

To answer your question in the comments. Note NaN and NA are not quite the same (See here). Note also that these two are very different from "NaN" and "NA", which are string literals (i.e. just text). But anyway, NAs are special 'atomic' elements which are nearly always handled exceptionally by functions. So you have to look into the documentation how a particular function handles NAs. In this case, the na.action argument applies to the values that you aggregate over, not the 'classes' in your formula. The drop=FALSE argument could also be used, but then you get all combinations of the (in this case) two classifications. Redefining the NA to a string literal works because the new name is treated like any other class.

thanks, this is what's needed. But can you please explain why it does not work with NaN and this have to be replaced? If I rename the `Unknown` to `NaN` again, it works my way. Something with the class type? — Maximilian, Oct 09 '18 at 12:49
(+1) Thanks, well, I'm aware of the difference of `NaN` and `NA` but somehow did not expect that the treatment of these are different within base `R` of `aggregate()` function. I still somehow expected this to be treated as another `level` factor on which I was wrong. Thanks. — Maximilian, Oct 09 '18 at 13:52

Use aggregate and keep NA rows

1 Answers1