How are results of complete.cases() and data[is.na(data)] <- 0 different?

Question

I have a dataframe data and after several computations on it, the final dataframe df.final has some missing values in it.

Before going ahead with further calculations on df.final, am I better off making all missing values zero's by

data[id.na(data)] <- 0

as mentioned here at How do I replace NA values with zeros in R?, or would doing

df.final <- df.final[complete.cases(df.final), ] # considering only one's without na

be more beneficial?

How are the two different?

Those two are completely different. `df[is.na(df)] <- 0` just replaces all NA values in the data with zero, while `df[complete.cases(df), ]` removes all rows in the data that contain at least one NA. The latter is also the same as `na.omit(df)` — Rich Scriven, Oct 03 '15 at 00:17

score 2 · Accepted Answer · answered Oct 03 '15 at 01:18

If you set NA to zero, then the effect on your calculations is as if you measured it and got zero. So if you're measuring temperatures in July, you'll get results as if you had a few frosty days in there. Your average temperature will be lower.

If you set na.rm=T or use complete.cases, the effect is as if that measurement never happened (which is the case, really). So our average temperature in July would be the average only for the days we did measure.

If you only have a few isolated NA values (sum(is.na())) then you might want to set them all to 0 (or some other sensible value, in this example the average temperature in July might be good).

I would only set to zero if there were vanishingly few (so I don't really care that it's skewing my measurements) or if zero was a sensible value (for example, if we want work experience in months, NA might well mean "no experience").

Software is soft: if your dataset is small enough, you can try both and observe how much it affects your data.

thanks for the clear and concise explanation. It gives me a better perspective on how to go ahead. — kRazzy R, Oct 05 '15 at 00:14

How are results of complete.cases() and data[is.na(data)] <- 0 different?

1 Answers1