0

I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.

My data frame is ufo and I remove the column 18 as follow:

ufo2 <- ufo[,1:17]

and then convert 3 non0numeric columns to numeric values:

ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)

and then use the following command for outlier detection:

outlier.scores <- lofactor(ufo2, k=5)

But all of the elements of the outlier.scores are NA!!!

Do I have any mistake in this code?

Is there another way to find outlier for such a data frame?

All of my code:

setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)

library(DMwR)

# load data
load("data_9802-f2.RData")

ufo2 <- ufo[,2:17]

ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)

outlier.scores <- lofactor(ufo2, k=5)

The output of the dput(head(ufo2)) is:

structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L, 
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L, 
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L, 
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L, 
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L, 
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900, 
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896, 
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L, 
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667, 
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93, 
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787, 
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin", 
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country", 
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight", 
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA, 
6L), class = "data.frame")
Mohammad
  • 1
  • 4
  • Hi and welcome to stackoverflow! You are much more likely to receive an answer if you provide a [minimal, reproducible data set](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) together with the code you have tried. Thanks! – Henrik Oct 02 '13 at 15:22
  • Thanks! the dataset is quiet large but my codes is as above, what more do you need to answer my question? – Mohammad Oct 02 '13 at 15:26
  • A sample of your data (provided using `dput(head(ufo2))`) and which packages you have loaded. Just a guess: Have you looked at your data after using `as.numeric`? – Roland Oct 02 '13 at 15:30
  • You don't need to post your entire data set. Please read the link I gave you and you will find various ways to post a small sample data set. Cheers. – Henrik Oct 02 '13 at 15:30
  • the output of dput command as is added to the question. and regarding the data after as.numeric, everything is ok with the data. – Mohammad Oct 02 '13 at 15:31
  • As Roland wrote: Please also add to the top of your script the package(s) necessary to run your script (e.g. `library(the-name-of-the-package)`) – Henrik Oct 02 '13 at 15:42
  • I can't reproduce this. Running `lofactor(ufo2,k=5)` returns `[1] 0.9276643 0.9276669 1.0788669 1.0839490 1.0839502 0.9276643` – mrip Oct 02 '13 at 16:34
  • Without appropriate preprocessing (including dropping irrelevant features such as DocNumber, OperatorID which make Euclidean distance meaningless) the results will be essentially random. – Has QUIT--Anony-Mousse Oct 04 '13 at 12:17

1 Answers1

1

First of all, you need to spend a lot more time preprocessing your data. Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.

For example produceCode. Are you sure, this should be part of your similarity?

Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!

Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.

Here's the link to the ELKI tutorial on implementing a custom distance function.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194