0

I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.

enter image description here

The corresponding R Data Frame looks as follows:

x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted

These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.

Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.

I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?

halfer
  • 19,824
  • 17
  • 99
  • 186
koVex
  • 641
  • 1
  • 6
  • 10

3 Answers3

0

Let's consider the two distributions as df_actual, df_predicted, then calculate

# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)

Then find the relative % difference by :

x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100

This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.

I hope this helps.

Community
  • 1
  • 1
parth
  • 1,571
  • 15
  • 24
  • I don't understand how that is supposed to work. If the above calculations are made, `df_diff` will just have two columns x and y that contain the same values... – koVex May 19 '17 at 12:23
  • actually, `df_diff` will contain difference between actual and predicted data points ie. `df_predicted$x - df_actual$x` and so on.. – parth May 19 '17 at 12:30
  • Yes, but since I can only subtract `actual` from `predicted` once, `x` and `y` will contain the same values. E.g. I take `3.823` (my first predicted value), from which I subtract `3.637` (my first actual value) and therefore get 0.186. My DF then looks like x | y 0.186 | 0.186 0.285 | 0.285 – koVex May 19 '17 at 12:48
  • oh got your point , i was assuming that your dataset has different `x` and `y` – parth May 22 '17 at 05:12
0

ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:

x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...

you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.

juod
  • 440
  • 3
  • 8
0

To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test

swtlk
  • 46
  • 5