2

What is the preferred way (in your opinion) to perform multivariate robust outlier detection in R in an automatic way, i.e. without manual inspection and plotting?

I have found the "dprep" package, but it seems discontinued. However, as outlier detection is a frequent and important task, a generic default method should be available, e.g. the MCD estimator (Rousseeuw and Van Driesen, 1999).

A.M.
  • 323
  • 2
  • 11
  • "what's the *preferred* way" still justifies the SO question, but `library("sos"); findFn("mcd estimator outlier")` would find a bunch of options including the `covMcd` function referred to by @Dieter Menne – Ben Bolker Dec 04 '11 at 15:19
  • 1
    This seems to be better suited to the stats.stackexchange site, IMO. – Hong Ooi Dec 04 '11 at 16:39

2 Answers2

2

Try covMcd in package robustbase.

Dieter Menne
  • 10,076
  • 44
  • 67
0

Use Cook's Distance enter image description here You could use cook's distance. Cook's distance is computed based on a linear regression model. That means, you will be able to include multiple X variables to compute the outlier (high influence observations, more precisely). This effectively gives you the flexibility to add or drop the variables on which you would want to determine the outliers. The way to compute it for every observation in R would look something like this:

mod <- lm(Y ~ X1 + X2 + X3, data=inputData)
cooksd <- cooks.distance(mod)

In general convention, those observations with a cook's distance > 4*mean(cooks distance) are considered outliers. For more information about the formula and interpretation of cook's distance refer to this example

Disclaimer: I am the author.

Selva
  • 2,045
  • 1
  • 23
  • 18