How to perform RMSE with missing values?

Question

I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.

But now I want to check the accuracy using the RMSE and I tried 2 options:

load the package hydroGOF and apply the rmse function
sqrt(mean (obs-sim)^2), na.rm=TRUE)

In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.

This is happening because the original data set contains an NA value (some values are missing).

How can I calculate the RMSE if I remove the missing values? Then obs and sim will have different sizes.

Your `na.rm=T` is in the wrong function. It's in `sqrt` but needs to be in `mean`. — Señor O, Jul 17 '13 at 15:25
Hi, since you are relatively new here you might want to read the [**about**](http://stackoverflow.com/about) and the [**faq**](http://stackoverflow.com/faq) about how SO works. StackOverflow is made much more valuable to everyone if when you receive an answer that solves your problem, you accept it by clicking the little check mark or upvote a useful answer (which you have *never* done!!). You are under absolutely no obligation to do either, but it is a great way to "give back" to the site if an answer did in fact solve your problem. Thanks! — Simon O'Hanlon, Jul 19 '13 at 22:45

Simon O'Hanlon · Answer 1 · 2013-07-17T15:01:24.377

21

How about simply...

sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )

Obviously assuming your dataframe is called df and you have to decide on your N ( i.e. nrow(df) includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df) you probably want to use sum( !is.na(df$measure) ) ) or, following @Joshua just

sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )

edited Jul 17 '13 at 15:01

answered Jul 17 '13 at 14:56

Simon O'Hanlon

58,647
14
142
184

5

or `sqrt(mean((df$model-df$measure)^2,na.rm=TRUE))` – Joshua Ulrich Jul 17 '13 at 14:59
@JoshuaUlrich yeah that would be easier. – Simon O'Hanlon Jul 17 '13 at 15:03
I rephrased the question because the problem is not the test itself. It is the missing values. – Telma_7919 Jul 17 '13 at 15:24
@Telma_7919 the missing values *can't* count because you didn't know what the measured variable is. So use the second line of code in the answer. It will remove the missing values and tell you how good your model is between observed and expected. – Simon O'Hanlon Jul 17 '13 at 15:31
@Telma_7919, the problem is in how you are treating the missing values. This answer treats them correctly. – Señor O Jul 17 '13 at 15:31

score 10 · Answer 2 · edited May 23 '17 at 11:47

The rmse() function in R package hydroGOF has an NA-remove parameter:

# require(hydroGOF)
rmse(sim, obs, na.rm=TRUE, ...)

which, according to the documentation, does the expected when na.rm is TRUE:

"When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value of obs AND sim are removed before the computation."

Without a minimal reproducible example, it's hard to say why that didn't work for you.

If you want to eliminate the missing values before you input to the hydroGOF::rmse() function, you could do:

my.rmse <- rmse(df.sim[rownames(df.obs[!is.na(df.obs$col_with_missing_data),]),]
     , df.obs[!is.na(df.obs$col_with_missing_data),])

assuming you have the "simulated" (imputed) and "observed" (original) data sets in different data frames named df.sim and df.obs, respectively, that were created from the same original data frame so have the same dimensions and row names.

Here is a canonical way to do the same thing if you have more than one column with missing data:

rows.wout.missing.values <- with(df.obs, rownames(df.obs[!is.na(col_with_missing_data1) & !is.na(col_with_missing_data2) & !is.na(col_with_missing_data3),]))
my.rmse <- rmse(df.sim[rows.wout.missing.values,], df.obs[rows.wout.missing.values,])

Note that my original answer used `dplyr`. I've since removed it since `dplyr::filter()` does _not_ retain the original rownames. You could still come up with a solution to use `dplyr` if you save off the original rownames as another column in the dataframe. — c.gutierrez, Oct 23 '14 at 23:02

How to perform RMSE with missing values?

2 Answers2

Linked