11

I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.

But now I want to check the accuracy using the RMSE and I tried 2 options:

  1. load the package hydroGOF and apply the rmse function
  2. sqrt(mean (obs-sim)^2), na.rm=TRUE)

In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.

This is happening because the original data set contains an NA value (some values are missing).

How can I calculate the RMSE if I remove the missing values? Then obs and sim will have different sizes.

Tung
  • 26,371
  • 7
  • 91
  • 115
Telma_7919
  • 209
  • 1
  • 6
  • 9
  • Ia, Sorry. I rephrased the question too. – Telma_7919 Jul 17 '13 at 15:23
  • 3
    Your `na.rm=T` is in the wrong function. It's in `sqrt` but needs to be in `mean`. – Señor O Jul 17 '13 at 15:25
  • Hi, since you are relatively new here you might want to read the [**about**](http://stackoverflow.com/about) and the [**faq**](http://stackoverflow.com/faq) about how SO works. StackOverflow is made much more valuable to everyone if when you receive an answer that solves your problem, you accept it by clicking the little check mark or upvote a useful answer (which you have *never* done!!). You are under absolutely no obligation to do either, but it is a great way to "give back" to the site if an answer did in fact solve your problem. Thanks! – Simon O'Hanlon Jul 19 '13 at 22:45

2 Answers2

21

How about simply...

sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )

Obviously assuming your dataframe is called df and you have to decide on your N ( i.e. nrow(df) includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df) you probably want to use sum( !is.na(df$measure) ) ) or, following @Joshua just

sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
10

The rmse() function in R package hydroGOF has an NA-remove parameter:

# require(hydroGOF)
rmse(sim, obs, na.rm=TRUE, ...)

which, according to the documentation, does the expected when na.rm is TRUE:

"When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value of obs AND sim are removed before the computation."

Without a minimal reproducible example, it's hard to say why that didn't work for you.

If you want to eliminate the missing values before you input to the hydroGOF::rmse() function, you could do:

my.rmse <- rmse(df.sim[rownames(df.obs[!is.na(df.obs$col_with_missing_data),]),]
     , df.obs[!is.na(df.obs$col_with_missing_data),])

assuming you have the "simulated" (imputed) and "observed" (original) data sets in different data frames named df.sim and df.obs, respectively, that were created from the same original data frame so have the same dimensions and row names.

Here is a canonical way to do the same thing if you have more than one column with missing data:

rows.wout.missing.values <- with(df.obs, rownames(df.obs[!is.na(col_with_missing_data1) & !is.na(col_with_missing_data2) & !is.na(col_with_missing_data3),]))
my.rmse <- rmse(df.sim[rows.wout.missing.values,], df.obs[rows.wout.missing.values,])
Community
  • 1
  • 1
c.gutierrez
  • 4,740
  • 1
  • 20
  • 14
  • Note that my original answer used `dplyr`. I've since removed it since `dplyr::filter()` does _not_ retain the original rownames. You could still come up with a solution to use `dplyr` if you save off the original rownames as another column in the dataframe. – c.gutierrez Oct 23 '14 at 23:02