Use of na.rm=True and na.action=na.pass in aggregate() - difference?

Question

I couldn't find a question here on Stack overflow that already answers my question so I'm very sorry if this has already been asked and I just couldn't find it.

All in all, this question is more about understanding what happens with my data depending what code I use.

So, I have a dataset with a few NAs in it.

I want to aggregate the data and use na.rm=True which tells R to ignore the NAs while calculating, right? The output I received included NAs and this lead to me using the function na.action=na.pass together with na.rm=True. This left me with significantly less NAs in my output. To be honest I don't understand why...

As I like to try out and find out for myself, I looked at different variations of my aggregate function:

only with na.rm=True
only with na.action=na.pass
na.rm=True, na.action=na.pass

only using 2. I get a lot of NAs, which makes sense because I told R to include all NAs in the calculation without having na.rm=True in it. At the same time 1. and 3. don't give me the same results. why is that?

I thought that the two na.rm=True and na.action=na.pass mean the same thing... apparently they don't, because I get slightly different values for my variables' means.

What happens with my data when I use both na.rm=True and na.action=na.pass in an aggregate function, compared to only using na.rm=True. Which is better to be used?

Thank you very much, I appreciate your help!

@Ronak Shah Sadly, I cannot share data, because it's confidential... I just need to know what difference it makes when using compared to just with an aggregate function... but thank you already for your help :) — Arya, Mar 17 '21 at 11:58

score 1 · Answer 1 · answered Mar 18 '21 at 07:18

Let's take a simple example to understand this :

df <- data.frame(a = c(2, 2, 1, 3, NA, NA), b = c(1, 1, 1, 2, 2, 3))
df
#   a b
#1  2 1
#2  2 1
#3  1 1
#4  3 2
#5 NA 2
#6 NA 3

Using aggregate with sum.

aggregate(a~b, df, sum)

#  b a
#1 1 5
#2 2 3

Notice that there is no b = 3 row in the output. Also b = 2 has 1 NA value but we it returned output of 3 without adding na.rm = TRUE. It means that by default all the NA values are dropped while calculating.

With na.action = 'na.pass'.

aggregate(a~b, df, sum, na.action = 'na.pass')

#  b  a
#1 1  5
#2 2 NA
#3 3 NA

By specifying na.action = na.pass we ask it to include all the NA values. Hence we now have a row for b = 3 and b = 2 is NA since we did not include na.rm = TRUE.

Using na.rm = TRUE.

aggregate(a~b, df, sum, na.rm = TRUE, na.action = 'na.pass')

#  b a
#1 1 5
#2 2 3
#3 3 0

I think output of this should be self-explanatory.

Use of na.rm=True and na.action=na.pass in aggregate() - difference?

1 Answers1

Linked