If you set NA
to zero, then the effect on your calculations is as if you measured it and got zero. So if you're measuring temperatures in July, you'll get results as if you had a few frosty days in there. Your average temperature will be lower.
If you set na.rm=T
or use complete.cases
, the effect is as if that measurement never happened (which is the case, really). So our average temperature in July would be the average only for the days we did measure.
If you only have a few isolated NA values (sum(is.na())
) then you might want to set them all to 0 (or some other sensible value, in this example the average temperature in July might be good).
I would only set to zero if there were vanishingly few (so I don't really care that it's skewing my measurements) or if zero was a sensible value (for example, if we want work experience in months, NA
might well mean "no experience").
Software is soft: if your dataset is small enough, you can try both and observe how much it affects your data.