Multivariate Outlier Detection using R with probability

Question

I have been searching everywhere for the best method to identify the multivariate outliers using R but I don't think I have found any believable approach yet.

We can take the iris data as an example as my data also contains multiple fields

data(iris)
df <- iris[, 1:4] #only taking the four numeric fields

Firstly, I am using Mahalanobis distance from the library MVN

library(MVN)
result <- mvOutlier(df, qqplot = TRUE, method = "quan") #non-adjusted
result <- mvOutlier(df, qqplot = TRUE, method = "adj.quan") #adjusted Mahalonobis distance

Both resulted in a large number of outliers (50 out of 150 for non-adjusted and 49/150 for adjusted), which I think needs more refinement. I unfortunately can't seem to find a variable in the mvOutlier method to set the threshold (says increasing the probability of a point being an outlier, so that we have a smaller number)

Secondly, I used outliers library. This is to find univariate outliers. So that, my plan is to find the outliers on each dimension of the data and those points being outliers on all the dimensions are regarded as outliers of the dataset.

library(outliers)
result <- scores(df, type="t", prob=0.95) #t test, probability is 0.95
result <- subset(result, result$Sepal.Length == T & result$Sepal.Width == T & result$Petal.Length == T & result$Petal.Width == T)

For this we can set the probability, but I don't think it can replace the multivariate outlier detection.

Some other approaches that I tried

library(mvoutlier): this only shows the plot. It is hard to automatically find outliers. And I don't know how to add the probability into this
cook's distance (link): a man said that he used cook's distance but I don't think there is any strong academic proof to prove that this is ok.

Andrew Haynes · Answer 1 · 2017-01-04T11:30:30.000

5

I'll leave you with these two links, the first is a paper on different methods for multivariate outlier detection, while the second one is looking at how to implement these in R.

Cook's Distance is a valid way of looking at the influence a datapoint has, and as such help detect outlying points. Mahalanobis Distance is also used regularly.

For your test example, the iris dataset is not useful. It is used for classification problems as it is clearly separable. Your exclusion of 50 data points would be getting rid of an entire species.

Outlier Detection in Multivariate Data-

http://www.m-hikari.com/ams/ams-2015/ams-45-48-2015/13manojAMS45-48-2015-96.pdf

R implementation

http://r-statistics.co/Outlier-Treatment-With-R.html

edited Jan 04 '17 at 11:30

answered Jan 04 '17 at 11:21

Andrew Haynes

2,612
2
20
35

Hi Andrew, thanks for your comment. Cook's distance seems good but I don't know what to put into the Y when you have to make a fit line lm(Y~., data) when all the data fields are equivalently independent. For Mahalanobis Distance, I don't think I have seen any R implementation. Also, the paper didn't say anything about increasing/decreasing the threshold. I am wondering what happens if I reduce the threshold to 3 time mean of cook's distance for outliers. Lastly, do you happen to know how to test the confidence/accuracy of these methods? – Duy Bui Jan 04 '17 at 12:29

score 1 · Answer 2 · answered Apr 15 '21 at 04:54

There are very interesting alternatives.

The first one, the Rlof package, which calculates the Local Outlier Factor. It calculates a score (called local outlier factor) that reflects the degree of anomaly of the observations. It measures the deviation of the local density of a point with respect to its neighbors. The idea is to detect samples that have a substantially lower density than their neighbors. In practice, the local density is obtained from the k nearest neighbors.

Second, the solitude package.

It applies a Random Forest inspired method called Isolation Forest.

Both generate a different score, which is not a probability, but allows determining the threshold from which to assign the number of anomalous data that the type of problem and the thematic knowledge deems appropriate.

Multivariate Outlier Detection using R with probability

2 Answers2