6

I'm looking to find the distance between the points and the prediction line. Ideally I would like the results to be displayed in a new column which contains the distance, called 'Distance'.

My Imports:

import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
%matplotlib inline 

Sample of my data:

idx  Exam Results  Hours Studied
0       93          8.232795
1       94          7.879095
2       92          6.972698
3       88          6.854017
4       91          6.043066
5       87          5.510013
6       89          5.509297

My code so far:

x = df['Hours Studied'].values[:,np.newaxis]
y = df['Exam Results'].values

model = LinearRegression()
model.fit(x, y)

plt.scatter(x, y,color='r')
plt.plot(x, model.predict(x),color='k')
plt.show()

My plot

Any help would be greatly appreciated. Thanks

Mark Kennedy
  • 180
  • 2
  • 9
  • Check this answer https://stackoverflow.com/questions/39840030/distance-between-point-and-a-line-from-two-points#39840218 – alec_djinn Apr 16 '18 at 14:30

1 Answers1

12

You simply need to assign the difference between y and model.predict(x) to a new column (or take absolute value if you just want the magnitude if the difference):

#df["Distance"] = abs(y - model.predict(x))  # if you only want magnitude
df["Distance"] = y - model.predict(x)
print(df)
#   Exam Results  Hours Studied  Distance
#0            93       8.232795 -0.478739
#1            94       7.879095  1.198511
#2            92       6.972698  0.934043
#3            88       6.854017 -2.838712
#4            91       6.043066  1.714063
#5            87       5.510013 -1.265269
#6            89       5.509297  0.736102

This is because your model predicts a y (dependent variable) for each independent variable (x). The x coordinates are the same, so the difference in y is the value you want.

pault
  • 41,343
  • 15
  • 107
  • 149
  • I seem to be getting this error when I try and run that line of code. ValueError: Length of values does not match length of index Any idea as to why this is? The shape of x is (132, 1), and the shape of y is (132,). – Mark Kennedy Apr 16 '18 at 15:25
  • 1
    What's the length of your dataframe? That error indicates that the problem is coming from the `df["Distance"] = ` part rather than the `y - model.predict(x)` part. You could also do `df["Distance"] = df['Exam Results'].values - model.predict(df['Hours Studied'].values[:,np.newaxis])`. – pault Apr 16 '18 at 15:28
  • The length of my dataframe is 1789, but that last piece of code seems to do the trick, thanks very much. Any idea as to why I encountered this problem? – Mark Kennedy Apr 16 '18 at 15:34
  • 1
    The error was because you were trying to assign 132 values to a dataframe with 1789 rows. I suspect that you built your model only on a subset of the data and were trying to then calculate the `Distance` for every row. – pault Apr 16 '18 at 15:44
  • i know this is a old post. but can any one help how to do this to each group after group by – moys Jul 14 '19 at 09:14