I have two Pandas DataFrames, with mostly different data:
err_df =
2 3 11 13 14 16
4 122.153000 56.3023 21.2722 71.79590 81.63212 NaN
8 70.967800 19.5768 69.9780 21.11050 116.89777 NaN
12 70.659100 19.5768 NaN 39.46288 70.62480 70.597850
16 19.237067 NaN NaN 18.93980 18.60660 19.104767
20 19.349440 NaN NaN 19.38080 NaN 36.785533
24 NaN NaN NaN 17.92060 NaN NaN
temp_df =
2 3 11 13 14 16
4 89.5488 122.153 121.957 122.153 122.153 NaN
8 89.5488 122.153 121.957 122.153 122.153 NaN
12 89.5488 122.153 NaN 122.153 122.153 122.153
16 89.5488 NaN NaN 122.153 122.153 122.153
20 89.5488 NaN NaN 122.153 NaN 122.153
24 NaN NaN NaN 122.153 NaN NaN
I want to calculate the Root Mean Squared Error (RMSE) between the columns of both the DataFrames and store the results in a 3rd DataFrame. I know how to calculate the RMSE for an individual column, let's say 2
:
print(((err_df[2] - temp_df[2])**2).mean()**0.5)
result = 48.2427158719
There's no trouble with the NaN
characters either - they're ignored, which is a relief because I think using sklearn
's mean_square_error
function gives this error ValueError: Array contains NaN or infinity.
Basically, I want to be able to calculate the RMSE values "dynamically" and not having to change the columns each time I run the main program.
The 3rd DataFrame that holds the results should look something like this:
df3 =
2 3 11 13 14 16
0 48.2427158719 "RMSE" "RMSE" "RMSE" "RMSE" "RMSE"
How do I achieve this?
Any help is appreciated. Thanks in advance :)
(Using Ubuntu 14.04 32-Bit VM and Python 2.7)