10

I've just started with R and I've executed these statements:

library(datasets)
head(airquality)
s <- split(airquality,airquality$Month)
sapply(s, function(x) {colMeans(x[,c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)})
lapply(s, function(x) {colMeans(na.omit(x[,c("Ozone", "Solar.R", "Wind")])) }) 

For the sapply, it returns the following:

             5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000

And for lapply, it returns the following:

$`5`
    Ozone   Solar.R      Wind 
 24.12500 182.04167  11.50417 

$`6`
    Ozone   Solar.R      Wind 
 29.44444 184.22222  12.17778 

$`7`
     Ozone    Solar.R       Wind 
 59.115385 216.423077   8.523077 

$`8`
    Ozone   Solar.R      Wind 
 60.00000 173.08696   8.86087 

$`9`
    Ozone   Solar.R      Wind 
 31.44828 168.20690  10.07586 

Now, my question would be, why are the returned values similar, but not the same? Isn't na.rm = TRUE and na.omit supposed to be doing the exact same thing? Omit the missing values and calculate the mean only for the values that we have? And in that case, shouldn't I have had the same values in both result sets?

Thank you so much for any input!

raluca
  • 143
  • 1
  • 2
  • 8
  • Your question is not reproducible, because you don't say what `s` is. I assume you created it somehow from the `airquality` dataset. Your questoin would be much more useful for further visitors, if you could also include to code to produce `s`. See [here](http://stackoverflow.com/q/5963269/4303162) for more information on reproducible examples. – Stibu Jan 11 '17 at 10:41
  • 2
    When trying to understand what's going on, simplify the problem as much as possible: you have two things changing: `lapply/sapply` and `na.rm/na.omit` . Where is the difference coming from? – csgillespie Jan 11 '17 at 10:45
  • @Stibu You're right. I loaded the airquality dataset in the environment and then tried running the statements given by the OP. None of them work for data.frame/matrix. Wondering what form was it converted to, before running these statements....maybe a list? – tushaR Jan 11 '17 at 10:55
  • oh yes, you're absolutely right, I did not provide the working set; but my question I guess was more about the difference na.rm/na.omit. Thank you for pointing that out, though – raluca Jan 11 '17 at 11:28
  • @raluca The point here is that the question is much less useful, if the code in it can not be run. Remember that this is not just about the help that you need, but also about someone else who has a similar problem and finds this question while trying to solve it. It is much harder to understand the question, the problem involved and the solution, if there is no code that can be run. – Stibu Jan 11 '17 at 12:01
  • you are right, and i've edited my post. Thanks again for pointing that out! – raluca Jan 11 '17 at 12:25

2 Answers2

18

They are not supposed to give the same result. Consider this example:

exdf<-data.frame(a=c(1,NA,5),b=c(3,2,2))
#   a b
#1  1 3
#2 NA 2
#3  5 2
colMeans(exdf,na.rm=TRUE)
#       a        b 
#3.000000 2.333333
colMeans(na.omit(exdf))
#  a   b 
#3.0 2.5

Why is this? In the first case, the mean of column b is calculated through (3+2+2)/3. In the second case, the second row is removed in its entirety (also the value of b which is not-NA and therefore considered in the first case) by na.omit and so the b mean is just (3+2)/2.

nicola
  • 24,005
  • 3
  • 35
  • 56
4

sapply(s, function(x) {colMeans(x[,c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)}) treats each column individually, and calculates the average of the non-NA values in each column.

lapply(s, function(x) {colMeans(na.omit(x[,c("Ozone", "Solar.R", "Wind")])) }) subsets sto those cases where none of the three columns are NA, and then takes the column means for the resulting data.

The difference comes from those rows which have one or two of the values as NA.

Miff
  • 7,486
  • 20
  • 20