How to plot the difference between two density distributions

Question

I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.

The corresponding R Data Frame looks as follows:

x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted

These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.

Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.

I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

Let's consider the two distributions as df_actual, df_predicted, then calculate

# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)

Then find the relative % difference by :

x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100

This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.

I hope this helps.

edited Jun 20 '20 at 09:12

Community

1
1

answered May 19 '17 at 10:55

parth

1,571
15
24

I don't understand how that is supposed to work. If the above calculations are made, `df_diff` will just have two columns x and y that contain the same values... – koVex May 19 '17 at 12:23
actually, `df_diff` will contain difference between actual and predicted data points ie. `df_predicted$x - df_actual$x` and so on.. – parth May 19 '17 at 12:30
Yes, but since I can only subtract `actual` from `predicted` once, `x` and `y` will contain the same values. E.g. I take `3.823` (my first predicted value), from which I subtract `3.637` (my first actual value) and therefore get 0.186. My DF then looks like x | y 0.186 | 0.186 0.285 | 0.285 – koVex May 19 '17 at 12:48
oh got your point , i was assuming that your dataset has different `x` and `y` – parth May 22 '17 at 05:12

score 0 · Answer 2 · answered May 20 '17 at 14:27

ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:

x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...

you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.

score 0 · Answer 3 · answered Aug 22 '18 at 17:34

0

To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test

answered Aug 22 '18 at 17:34

swtlk

46
5

How to plot the difference between two density distributions

3 Answers3

Linked