2

I couldn't find a question here on Stack overflow that already answers my question so I'm very sorry if this has already been asked and I just couldn't find it.

All in all, this question is more about understanding what happens with my data depending what code I use.

So, I have a dataset with a few NAs in it.

I want to aggregate the data and use na.rm=True which tells R to ignore the NAs while calculating, right? The output I received included NAs and this lead to me using the function na.action=na.pass together with na.rm=True. This left me with significantly less NAs in my output. To be honest I don't understand why...

As I like to try out and find out for myself, I looked at different variations of my aggregate function:

  1. only with na.rm=True
  2. only with na.action=na.pass
  3. na.rm=True, na.action=na.pass

only using 2. I get a lot of NAs, which makes sense because I told R to include all NAs in the calculation without having na.rm=True in it. At the same time 1. and 3. don't give me the same results. why is that?

I thought that the two na.rm=True and na.action=na.pass mean the same thing... apparently they don't, because I get slightly different values for my variables' means.

What happens with my data when I use both na.rm=True and na.action=na.pass in an aggregate function, compared to only using na.rm=True. Which is better to be used?

Thank you very much, I appreciate your help!

Lukas Thaler
  • 2,672
  • 5
  • 15
  • 31
Arya
  • 29
  • 2
  • Do you have example data to share? – Ronak Shah Mar 17 '21 at 11:49
  • @Ronak Shah Sadly, I cannot share data, because it's confidential... I just need to know what difference it makes when using compared to just with an aggregate function... but thank you already for your help :) – Arya Mar 17 '21 at 11:58

1 Answers1

1

Let's take a simple example to understand this :

df <- data.frame(a = c(2, 2, 1, 3, NA, NA), b = c(1, 1, 1, 2, 2, 3))
df
#   a b
#1  2 1
#2  2 1
#3  1 1
#4  3 2
#5 NA 2
#6 NA 3
  • Using aggregate with sum.
aggregate(a~b, df, sum)

#  b a
#1 1 5
#2 2 3

Notice that there is no b = 3 row in the output. Also b = 2 has 1 NA value but we it returned output of 3 without adding na.rm = TRUE. It means that by default all the NA values are dropped while calculating.

  • With na.action = 'na.pass'.
aggregate(a~b, df, sum, na.action = 'na.pass')

#  b  a
#1 1  5
#2 2 NA
#3 3 NA

By specifying na.action = na.pass we ask it to include all the NA values. Hence we now have a row for b = 3 and b = 2 is NA since we did not include na.rm = TRUE.

  • Using na.rm = TRUE.
aggregate(a~b, df, sum, na.rm = TRUE, na.action = 'na.pass')

#  b a
#1 1 5
#2 2 3
#3 3 0

I think output of this should be self-explanatory.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213