Using aggregate in a dataframe with NA without dropping rows

Question

I am using aggregate to get the means of several variables by a specific category (cy), but there are a few NA's in my dataframe. I am using aggregate rather than ddply because from my understanding it takes care of NA's similarly to using rm.na=TRUE. The problem is that it drops all rows containing NA in the output, so the means are slightly off.

Dataframe:

> bt cy cl pf ne YH YI 1 1 H 1 95 70.0 20 20 2 2 H 1 25 70.0 46 50 3 1 H 1 0 70.0 40 45 4 2 H 1 95 59.9 40 40 5 2 H 1 75 59.9 36 57 6 2 H 1 5 70.0 35 43 7 1 H 1 50 59.9 20 36 8 2 H 1 95 59.9 40 42 9 3 H 1 95 49.5 17 48 10 2 H 1 5 70.0 42 42 11 2 H 1 95 49.5 19 30 12 3 H 1 25 49.5 33 51 13 1 H 1 75 49.5 5 26 14 1 H 1 5 70.0 35 37 15 1 H 1 5 59.9 20 40 16 2 H 1 95 49.5 29 53 17 2 H 1 75 70.0 41 41 18 2 H 1 0 70.0 10 10 19 2 H 1 95 49.5 25 32 20 1 H 1 95 59.9 10 11 21 2 H 1 0 29.5 20 28 22 1 H 1 95 29.5 11 27 23 2 H 1 25 59.9 26 26 24 1 H 1 5 70.0 30 30 25 3 H 1 25 29.5 20 30 26 3 H 1 50 70.0 5 5 27 1 H 1 0 59.9 3 10 28 1 K 1 5 49.5 25 29 29 2 K 1 0 49.5 30 32 30 1 K 1 95 49.5 13 24 31 1 K 1 0 39.5 13 13 32 2 M 1 NA 70.0 45 50 33 3 M 1 25 59.9 3 34'

The full dataframe has 74 rows, and there are NA's peppered throughout all but two columns (cy and cl).

My code looks like this: meancnty<-(aggregate(cbind(pf,ne,YH,YI)~cy, data = newChart, FUN=mean))

I double checked in excel, and the means this function produces are for a dataset of N=69, after removing all rows containing NA's. Is there any way to tell R to ignore the NA's rather than remove the rows, other than taking the mean of each variable by county (I have a lot of variables to summarize by many different categories)?

Thank you

If your problem is with `NA` values, you should probably include a sample of data with some `NA` values in it... — Gregor Thomas, May 10 '17 at 20:31
@Gregor - I just added the whole dataframe with all the NA's — Anthony, May 10 '17 at 20:47
`aggregate(cbind(pf,ne,YH,YI)~cy, data = newChart, FUN=mean, na.rm=TRUE, na.action=na.pass)` - You can remove `NA`s in the mean calculation using the `na.rm=TRUE` argument, while `na.action=na.pass` means that the `NA` values are not deleted prior to the function being run. `na.action=na.omit` is the default for the `aggregate` formula interface. — thelatemail, May 10 '17 at 22:35

score 2 · Accepted Answer · answered May 10 '17 at 21:30

2

using dplyr

df %>%
  group_by(cy) %>%
  summarize_all(mean, na.rm = TRUE)

#      cy       bt        cl       pf       ne       YH       YI
# 1     H 1.785714 0.7209302 53.41463 51.75952 21.92857 29.40476
# 2     K 1.333333 0.8333333 33.33333 47.83333 20.66667 27.33333
# 3     M 1.777778 0.4444444 63.75000 58.68889 24.88889 44.22222
# 4     O 2.062500 0.8750000 31.66667 53.05333 18.06667 30.78571

answered May 10 '17 at 21:30

Andrew Lavers

4,328
1
12
19

it worked! You have rescued me from hours of banging my head against the desk - thank you! – Anthony May 10 '17 at 21:47

Yannis Vassiliadis · Answer 2 · 2017-05-10T21:31:04.577

I think this will work:

meancnty<-(aggregate(with(newChart(cbind(pf,ne,YH,YI), 
           by = list(newchart$cy), FUN=mean, na.rm=T))

I used the following test data:

> q<- data.frame(y = sample(c(0,1), 10, replace=T), a = runif(10, 1, 100), b=runif(10, 20,30))
> q$a[c(2, 5, 7)]<- NA
> q$b[c(1, 3, 4)]<- NA
> q
   y        a        b
1  0 86.87961       NA
2  0       NA 22.39432
3  0 89.38810       NA
4  0 12.96266       NA
5  1       NA 22.07757
6  0 73.96121 24.13154
7  0       NA 22.31431
8  1 62.77095 21.46395
9  0 55.28476 23.14393
10 0 14.01912 28.08305

Using your code from above, I get:

> aggregate(cbind(a,b)~y, data=q, mean, na.rm=T)
  y        a        b
1 0 47.75503 25.11951
2 1 62.77095 21.46395

which is wrong, i.e. it deletes all rows with any NAs and then takes the mean. This however gave the right result:

> aggregate(with(q, cbind(a, b)), by = list(q$y), mean, na.rm=T)
  Group.1        a        b
1       0 55.41591 24.01343
2       1 62.77095 21.77076

It did na.rm=T by column first, and then took the average by group.

Unfortunately, I have no idea why that is, but my guess is that is has to do with the class of y.

Thanks, @Yannis Vassiliadis! This one works as well. I wonder if it's the list function that keeps the rows from being dropped. — Anthony, May 10 '17 at 22:02
@Anthony - the standard version of `aggregate` not using formulas doesn't have an `na.action=na.omit` by default. It has nothing to do with the `list()` usage. The whole thing could be simplified too - `aggregate(q[c("a","b")], q["y"], mean, na.rm=TRUE)` for instance as selecting like `q[c("var")]` will retain `data.frame/list` objects. — thelatemail, May 10 '17 at 22:44

Using aggregate in a dataframe with NA without dropping rows

2 Answers2