1

When I aggregate a data frame like below I notice that some of the aggregated by column values are getting dropped

    set.seed(100)
    b <- data.frame(id=sample(1:3, 5, replace=TRUE),
         prop1=sample(c(TRUE,FALSE),5, replace = TRUE),
         prop2= sample(c(TRUE,FALSE,NA), 5, replace= TRUE))

    > b
      id prop1 prop2
    1  3 FALSE  TRUE
    2  1 FALSE    NA
    3  2 FALSE    NA
    4  2 FALSE FALSE
    5  3  TRUE  TRUE
    > aggregate(. ~ id, b, function(x) { length(x[x == TRUE])/length(x)})
      id prop1 prop2
    1  2   0.0     0
    2  3   0.5     1

What happened to id 1 here - why is it dropped ?

smci
  • 32,567
  • 20
  • 113
  • 146
user3206440
  • 4,749
  • 15
  • 75
  • 132

1 Answers1

0

If you look at the help of aggregate, you will see that there is a parameter to specify how missing values are treated: na.action. After some trials, I found a seed that recreates your issue ;)

set.seed(3)
b <- data.frame(id=sample(1:6, 10, replace=TRUE),
            prop1=sample(c(TRUE,FALSE),10, replace = TRUE),
            prop2= sample(c(TRUE,FALSE,NA), 10, replace= TRUE))
b

   id prop1 prop2
1   3  TRUE  TRUE
2   6  TRUE    NA
3   4 FALSE FALSE
4   4 FALSE  TRUE
5   4  TRUE    NA
6   3  TRUE    NA
7   2 FALSE FALSE
8   3  TRUE FALSE
9   3  TRUE  TRUE
10  4 FALSE FALSE

So we have this id 6.

This should do the stuff:

aggregate(. ~ id, b, function(x) { sum(x,na.rm=TRUE)/length(x)}, na.action=NULL)

  id prop1 prop2
1  2  0.00  0.00
2  3  1.00  0.50
3  4  0.25  0.25
4  6  1.00  0.00
Eric Lecoutre
  • 1,461
  • 16
  • 25
  • By the way, this is behavior for formula method of aggregate. I found again this initial post where I did learn that with useful details: http://stackoverflow.com/questions/16844613/na-values-and-r-aggregate-function – Eric Lecoutre Feb 08 '17 at 20:45
  • why wouldn't `aggregate(. ~ id, b, function(x) { length(x[x == TRUE])/length(x)}, na.action=NULL)` give same results ? – user3206440 Feb 08 '17 at 21:19
  • Oh;; depends on what you want as final result. look at `x[x==TRUE]` when there are some `NA` within `x`. With `na.action=NULL`, all values are passed and treated by `function(x)` so ultimately it depends whether you want to include `NA` in computations or not (hence my `sum(..., na.rm=TRUE)` to avoid counting them) – Eric Lecoutre Feb 09 '17 at 07:07