How to remove outliers from data set using Cook's distance?

Question

We are required to remove outliers/influential points from the data set in a model. I have 400 observations and 5 explanatory variables.

I have tried this:

Outlier <- as.numeric(names (cooksdistance)[(cooksdistance > 4 / sample_size)))

Where Cook's distance is the calculated Cook's distance for the model.

The problem is that this doesn’t give me the actual outliers.

Welcome to SO! This community has a few [rules](https://stackoverflow.com/help/on-topic) and [norms](https://stackoverflow.com/help/how-to-ask) and following them will help you get a good answer to your question. In particular, it’s best to provide an [MCVE](https://stackoverflow.com/help/mcve) (a minimum, complete, and verifiable example). Good advice for R-specific MVCEs is available [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) and [here](https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html). — DanY, Sep 13 '18 at 20:44
I think you probably mean to remove outliers from your *data*, not from your *model*. This means you will need to reference your data frame in your code... It will be much easier to help if you follow some of Dan's advice and create a reproducible example. — Gregor Thomas, Sep 13 '18 at 20:53
'R for Data Science' is an excellent resource. http://r4ds.had.co.nz/exploratory-data-analysis.html. Scroll down to 7.3.3 Unusual Values. Also, google removing outliers in r... there are many, many results. — djchapman, Sep 13 '18 at 20:54
Cook distance gives you the leverage points which are not necessarily outliers, also as other have mentioned you want to remove outliers before model fitting, if at all. — user2974951, Sep 14 '18 at 06:27

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

In the formula you used for influential observation selection the condition should be as follows: if an observation has the Cook's distance more than 4 time of Cook's distance mean it can be considered ifluential (potentially an outlier).

Cook's distance or Cook's D is a commonly used estimate of the influence of a data point
when performing a least-squares regression analysis.

In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data > points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points.

In general use, those observations that have a cook’s distance greater than 4 times the mean may be classified as influential. This is not a hard boundary.

Please see as an example the influential observation identification for ozone data set:

ozone <- read.csv("http://rstatistics.net/wp-content/uploads/2015/09/ozone.csv")
m <- lm(ozone_reading ~ ., data=ozone)
cooksdistance <- cooks.distance(m)

influential <- as.numeric(names(cooksdistance)[(cooksdistance > 4 * mean(cooksdistance, na.rm = TRUE))]) 

ozone[influential, ]
#     Month Day_of_month Day_of_week ozone_reading pressure_height Wind_speed Humidity Temperature_Sandburg Temperature_ElMonte
# 19      1           19           1          4.07            5680          5       73                   52               56.48
# 23      1           23           5          4.90            5700          5       59                   69               51.08
# 58      2           27           5         22.89            5740          3       47                   53               58.82
# 133     5           12           3         33.04            5880          3       80                   80               73.04
# 135     5           14           5         31.15            5850          4       76                   78               71.24
# 149     5           28           5          4.82            5750          3       76                   65               51.08
# 243     8           30           1         37.98            5950          5       62                   92               82.40
# 273     9           29           3          4.60            5640          5       93                   63               54.32
# 286    10           12           2          7.00            5830          8       77                   71               67.10
#     Inversion_base_height Pressure_gradient Inversion_temperature Visibility
# 19                    393               -68                 69.80         10
# 23                   3044                18                 52.88        150
# 58                    885                -4                 67.10         80
# 133                   436                 0                 86.36         40
# 135                  1181                50                 79.88         17
# 149                  3644                86                 59.36         70
# 243                   557                 0                 90.68         70
# 273                  5000                30                 52.70         70
# 286                   337               -17                 81.14         20

Interpretation:

Row 58, 133, 135 have very high ozone_reading.

Rows 23, 135 and 149 have very high Inversion_base_height.

Row 19 has very low Pressure_gradient.

How to remove outliers from data set using Cook's distance?

1 Answers1