3

I am working on finding out outliers using Mahalanobis distance in R. I have a dataset with 30 rows and 24 columns, which I feed into the mahanalobis function from stats package.I want to create find distance of each vector with rest of the rows. The results look good till I export the same input data and same code to another machine and rerun the code, which gives different results than the one seen on machine1. Is this expected behaviour ? or am I missing something. Please advice.

Code I used:

m_dist <- mahalanobis(data[, 2:25], colMeans(data[, 2:25]), cov(data[,2:25]),tol=1e-20)

Then I used boxplot on m_dist to identify the outliers. The result on first machine doesnt match the same on second. I even used set.seed(1007) on both machines just to check, but results are still different

I found another thread which discusses the result difference in python, but it doesnt help me in anyway...

Roland
  • 127,288
  • 10
  • 191
  • 288
Varun kadekar
  • 427
  • 6
  • 15
  • 2
    `mahalanobis` uses `solve` internally. Different systems (including different results from `La_library()`) can give different results. – Roland Oct 22 '18 at 08:33
  • Also, I would not set the tolerance lower than the default without looking into the underlying maths in more detail. – Roland Oct 22 '18 at 08:36
  • Ohk... is there any resolution for this? is there a way I can get same result? – Varun kadekar Oct 22 '18 at 09:07
  • I suspect you can get the same result if you increase the tolerance and ensure that the same LAPACK implementation is used. – Roland Oct 22 '18 at 09:20
  • I could confirm your findings. For me, using Microsoft R Open, which uses Intel MKL as LA foundation, produced slightly different results wrt stock R from R project – Severin Pappadeux Oct 22 '18 at 15:23
  • Ohk thanks. Then wouldnt it be unsafe to productionize this approach? Its risky to put the code in production that can give different results on different machines. Whats your opinion? – Varun kadekar Oct 23 '18 at 08:39

0 Answers0