-3

Hello dear boys and girls, I apologize if the question is not in the right place (talking about the right forum - stackoverflow, etc.)

I can use python and R on a semi-intermediate level... I have been wondering for a while about the topic of this question:

  1. If i have a data set that i can build a statistical model on then all is well. I build the model, test it, test it again, make a score card and poof.
  2. I want to know... Is there a way of (theoretically or even practically) to detect irregularities/outliners in data without a previous data set that (for example) you can build a statistical model on. I mean a way that excludes checking 400 million records and flagging the irregs as such and then doing something productive.

Is this possible? Identifying such things without a preset solid definition for the given data set? Lets take accounting records for example. I have "x" amount of records and i want to detect any records that are not "natural" for the data set. is there a way to code a system that does that - given that you don't have prior data with such records flagged as not normal?

Emil Filipov
  • 87
  • 12
  • This isn't really a question about prorgramming so it doesn't seem like a good fit for stack overflow. General questions about evaluating the fit of different models or detecting outliers are really statistical questions. Those can be asked over at [stats.se]. But as written this question is too broad to be useful. Different models may have different sensitivities to "outliers and it can be hard for a human to well what's a "real" value vs a "bad" one so it's even harder for a computer without a clear definition. – MrFlick Oct 14 '16 at 17:57

2 Answers2

2

Your question is very broad. Ultimately you ask for unsupervised learning instead of supervised learning. The answer will depend on "how are these records not natural" or what does natural mean. If you have no better starting point or modell, you might start with cluster analysis. If by far most records are natural in that they have a small distance and some few oultiers lie far away, cluster analysis will help you find those. The interesting point is how you define "distance" depending on each problem at hand.

An obvious starting point would be the function hclustin R and you will find all sorts of high quality packages in the CRAN Task View on Cluster Analysis: https://cran.r-project.org/web/views/Cluster.html

Bernhard
  • 4,272
  • 1
  • 13
  • 23
  • I have been given this task of designing a "system" to detect outliners without having a prior policy for doing so (there are no set criteria to do so)and that is exactly the problem - how do i define the "distance", as you call it, at all, not to mention in the best possible way, so i don't get 50% outliners in a data set. Thank you for the suggestion about cluster analysis. I will be sure to look into that. – Emil Filipov Oct 14 '16 at 14:22
  • When you start reading into cluster analysis you will inevitably come across a range of distance functions (if you don't know what you are looking for: euclidean distance is a good first step in many situations). If outliers consist of extremely large numbers or extremely low numbers, hierarchical clustering should find that easily. If outliers breaks some rule of repetition, it will be more difficult. "Unsupervised learning" remains you search term, if cluster analysis does not lead to your goal. – Bernhard Oct 14 '16 at 14:26
  • Hmmm, yes, perhaps. By the looks of it i will be very busy in the days to come. I will have to do research on cluster analysis before i can ask anything else regarding the problem. Thank you again! – Emil Filipov Oct 14 '16 at 14:39
1

There is one sentence you will find in all serious statistic books: Know your data. Its part of the work (and most of the time the largest part) to clean and getting to know your data. Therefore there is not relay a standard procedure, but some hints:

  • Numerical Data: Make a lot of plots eg boxplots, scatterplots, histograms etc.
  • Categorical Data: Make some counts, eg use table

Some more technical discussion you find here : How to remove outliers from a dataset or some tutorials here https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/

Hth ben

Community
  • 1
  • 1
holzben
  • 1,459
  • 16
  • 24