Python: How to handle outliers in a regression Q-Q plot?

Question

I draw the qq plot multiple regression and I got below graph. Can someone tell me why there are two points under the red line? And do these points have an effect on my model?

I used below code for draw the graph.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg = reg.fit(x_train,y_train)

pred_reg_GS = reg.predict(x_test)
diff= y_test-pred_reg_GS

import statsmodels.api as sm
sm.qqplot(diff,fit=True,line='45')
plt.show()

vestland · Accepted Answer · 2020-01-09T14:07:11.710

Take a look at Understanding Q-Q Plots for a concise description of what a QQ plot is. In your case, this particular part is important:

If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

This theoretical one-to-one relationship is illustrated explicitly in your plot using the red line.

And regarding your question...

that points effect for my model?

... one or both points that occur far from that red line could be conisered to be outliers. This means that whatever model you've tried to build here does not capture the properties of those tho observations. If what we're looking at here is a QQ plot of the residuals from a regression model, you should take a closer look at those two observations. What is it with these two that make them stand out from the rest of your sample? One way to "catch" these outliers is often to represent them with one or two dummy variables.

Edit 1: Basic approach for outliers and dummy variables

Since you haven't explicitly labeled your question sklearn I'm taking the liberty to illustrate this using statsmodels. And in lieu of a sample of your data, I'll just use the built-in iris dataset where the last part of what we'll use looks like this:

1. Linear regression of sepal_width on sepal_length

Plot 1:

Looks good! Nothing wrong here. But let's mix it up a bit by adding some extreme values to the dataset. You'll find a complete code snippet at the end.

2. Introduce an outlier

Now, lets add a line in the dataframe where ``sepal_width = 8instead of3`. This will give you the following qqplot with a very clear outlier:

And here's a part of the model summary:

===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
sepal_width     1.8690      0.033     57.246      0.000       1.804       1.934
==============================================================================
Omnibus:                       18.144   Durbin-Watson:                   0.427
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                7.909
Skew:                          -0.338   Prob(JB):                       0.0192
Kurtosis:                       2.101   Cond. No.                         1.00
==============================================================================

So why is this an outlier? Because we messed with the dataset. The reason for the outliers in your dataset is impossible for me to determine. In our made-up example the reason for a setosa iris to have a sepal width if 8 could be many. Maybe the scientist labeled it wrong? Maybe it isn't a setosa at all? Or maybe it has been genetically modified? Now, instead of just discarding this observation from the sample, it's usually more informative to keep it where it is, accept that there is something special with this observation, and illustrate exactly that by including a dummy variable that is 1 for that observation and 0 for all other. Now the last part of your dataframe should look like this:

3. Identify the outlier using a dummy variable

Now, your qqplot will look like this:

And here's your model summary:

=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
sepal_width       1.4512      0.015     94.613      0.000       1.420       1.482
outlier_dummy    -6.6097      0.394    -16.791      0.000      -7.401      -5.819
==============================================================================
Omnibus:                        1.917   Durbin-Watson:                   2.188
Prob(Omnibus):                  0.383   Jarque-Bera (JB):                1.066
Skew:                           0.218   Prob(JB):                        0.587
Kurtosis:                       3.558   Cond. No.                         27.0
==============================================================================

Notice that the inclusion of a dummy variable changes the coefficient estimate for sepal_widht, and also the values for Skewness and Kurtosis. And that's the short version of the effects an outlier will have on your model.

Complete code:

import numpy as np
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
import seaborn as sns

# sample data
df = pd.DataFrame(sns.load_dataset('iris'))

# subset of sample data
df=df[df['species']=='setosa']

# add column for dummy variable
df['outlier_dummy']=0

# append line with extreme value for sepal width
# as well as a dummy variable = 1 for that row.
df.loc[len(df)] = [5,8,1.4, 0.3, 'setosa', 1]

# define independent variables
x=['sepal_width', 'outlier_dummy']

# run regression
mod_fit = sm.OLS(df['sepal_length'], df[x]).fit()
res = mod_fit.resid

fig = sm.qqplot(res)
plt.show()
mod_fit.summary()

Thanks vestland. i'm not clear about your last sentence "One way to "catch" these outliers is often to represent them with one or two dummy variables." can you explain little more about this sentence? it is very need full me. — randunu galhena, Jan 03 '20 at 17:48
@randunugalhena I'll try to get back to you on monday. In the meantime, it would be great if you could include your code and a datasample in the question! — vestland, Jan 03 '20 at 19:18
thanks @vestland. i'm sorry i couldn't reply your massage earlier. i put my code in question. — randunu galhena, Jan 06 '20 at 04:42
@randunugalhena Do you have a data sample? If you're using pandas dataframes you can just run, for example: x_train.head(10).to_dict() and add that dictionary in the question. — vestland, Jan 07 '20 at 09:53
Thanks for your help. I run [ x_train.head(10).to_dict() ] but my sample is too big. so its can't upload in question section. Do you have idea how its can upload? — randunu galhena, Jan 09 '20 at 09:55
I'm appreciate your answer. I have some question about your explanation regarding my problem 1) How I can identify my outlier, 2 observation because my data set too large. 2) If I identified 2 outliers then I completely removed those 2 instead of including dummy variable, is it OK? because my final data set have more than 220 variables. And what is meaning when we are adding those dummy variables for the final model (how I interpret those variables) — randunu galhena, Jan 10 '20 at 03:04
@randunugalhena **1:** This is a huge toping in machine-learning, so you should really invest some time to dive really deep in it. But in short: Think of what you're doing when you are removing outliers: **You are actively removing information to make your model fit the real world better.** You can only justify this in the hopefylly very few cases that you have a real suspicion that something is wrong with this particular observation. — vestland, Jan 10 '20 at 10:20
@randunugalhena **2:** Take a look my contribution to the question [how can I detect the peak points outliers from my pandas DataFrame](https://stackoverflow.com/questions/51006163/how-can-i-detect-the-peak-points-outliers-from-my-pandas-dataframe/51018875#51018875) — vestland, Jan 10 '20 at 10:22
@randunugalhena **3:** If you're experiencing a lot of outliers, then your model just does not describe the real world relationship between your variables very well. So if outliers are a huge problem, you should consider other models, a different dataset. Or both. — vestland, Jan 10 '20 at 10:26
@randunugalhena **4:** But if you are willing to identify your outliers using dummy variables, the interpretation of those variables will (on paper) be clear by comparing the model outputs with and without those dummy variables. Just like I did in my answer to your question. — vestland, Jan 10 '20 at 10:28
@randunugalhena **5:** So in the example above, the "effect" of a unit change in `X` is described by the coefficent `1.86` units in `Y`. But you know you have a problem with outlers here. In other words: Your model does not "catch" all the variation in `Y`. If you include a dummy variable, your estimate of X on Y changes to `1.45`. It also turns out that your dummy variable is signifcant with a `p-value of 0.000` (and t-value=-16,791) and so should be included in your model. — vestland, Jan 10 '20 at 10:36
@randunugalhena Happy to help! And please take a look at the answers to [how can I detect the peak points (outliers) from my pandas DataFrame](https://stackoverflow.com/questions/51006163/how-can-i-detect-the-peak-points-outliers-from-my-pandas-dataframe/51018875#51018875). If you are determined to remove outliers above som level for all columns of your dataset, you can do so with the function I put together there. — vestland, Jan 10 '20 at 12:26

Python: How to handle outliers in a regression Q-Q plot?

1 Answers1

1. Linear regression of sepal_width on sepal_length

2. Introduce an outlier

3. Identify the outlier using a dummy variable