Questions tagged [outliers]

An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset.

Overview

Outliers are not necessarily bad or wrong, nor do they need to be removed from data for further analysis. However, outliers (of which there can be more than one in any set of data) indicate that some data at least appear to differ from the bulk of the dataset, suggesting they should be individually examined and understood. Also, some statistical procedures are sensitive to outliers: this means that removal of one or more outliers could substantially change the conclusions of those procedures.

Tag usage

Consider whether the question would be more suitable on Stack Overflow SE (programming-related) or Cross Validated SE (statistics-related).

In scientific software for statistical computing and graphics, function boxplot.stats provides a basic method for detecting outliers.

1199 questions
379
votes
18 answers

Detect and exclude outliers in a pandas DataFrame

I have a pandas data frame with few columns. Now I know that certain rows are outliers based on a certain column value. For instance column 'Vol' has all values around 12xx and one value is 4000 (outlier). Now I would like to exclude those rows…
AMM
  • 17,130
  • 24
  • 65
  • 77
46
votes
1 answer

Matplotlib boxplot without outliers

Is there any way of hiding the outliers when plotting a boxplot in matplotlib (python)? I'm using the simplest way of plotting it: from pylab import * boxplot([1,2,3,4,5,10]) show() This gives me the following plot: (I cannot post the image…
Didac Busquets
  • 605
  • 1
  • 5
  • 8
44
votes
5 answers

matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller. With matplotlib, I plot against the range [0, max_data_value] How can I just zoom…
Ricky Robinson
  • 21,798
  • 42
  • 129
  • 185
34
votes
1 answer

How to remove outliers in boxplot in R?

Possible Duplicate: Changing the outlier rule in a boxplot I need to visualize my result using box-plot. x<-rnorm(10000) boxplot(x,horizontal=TRUE,axes=FALSE) How can i filter outliers during visualisation? (1) So that i can have full image…
Manish
  • 3,341
  • 15
  • 52
  • 87
31
votes
5 answers

Remove Outliers in Pandas DataFrame using Percentiles

I have a DataFrame df with 40 columns and many records. df: User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39 For each column except the user_id column I want to check for outliers and remove the whole record, if an outlier…
Mi Funk
  • 455
  • 1
  • 6
  • 8
28
votes
3 answers

How to use Isolation Forest

I am trying to detect the outliers to my dataset and I find the sklearn's Isolation Forest. I can't understand how to work with it. I fit my training data in it and it gives me back a vector with -1 and 1 values. Can anyone explain to me how it…
dapo
  • 697
  • 1
  • 12
  • 22
23
votes
2 answers

Need a data set for fraud detection

I have a fraud detection algorithm, and I want to check to see if it works against a real world data set. My algorithm says that a claim is usual or not. Are there any data sets available?
saeed arash
  • 865
  • 2
  • 8
  • 14
22
votes
3 answers

Time series forecasting, dealing with known big orders

I have many data sets with known outliers (big orders) data <-…
18
votes
5 answers

Remove outliers fully from multiple boxplots made with ggplot2 in R and display the boxplots in expanded format

I have some data here [in a .txt file] which I read into a data frame df, df <- read.table("data.txt", header=T,sep="\t") I remove the negative values in the column x (since I need only positive values) of the df using the following code, yp <-…
Amm
  • 1,749
  • 4
  • 17
  • 27
18
votes
4 answers

How to use Outlier Tests in R Code

As part of my data analysis workflow, I want to test for outliers, and then do my further calculation with and without those outliers. I've found the outlier package, which has various tests, but I'm not sure how best to use them for my workflow.
PaulHurleyuk
  • 8,009
  • 15
  • 54
  • 78
15
votes
8 answers

Algorithm to quickly find animals away from the herd

I am developing a simulation program. There are herds of animals (wildebeests), and in that herd, I need to be able to find one animal that is away from the herd. On the picture below, green dots are away from the herd. It is these points that I'd…
13
votes
2 answers

Multivariate Outlier Detection using R with probability

I have been searching everywhere for the best method to identify the multivariate outliers using R but I don't think I have found any believable approach yet. We can take the iris data as an example as my data also contains multiple fields…
Duy Bui
  • 1,348
  • 6
  • 17
  • 38
13
votes
2 answers

Finding the outlier points from matplotlib : boxplot

I am plotting a non-normal distribution using boxplot and interested in finding out about outliers using boxplot function of matplotlib. Besides the plot I am interested in finding out the value of points in my code which are shown as outliers in…
Abhi
  • 6,075
  • 10
  • 41
  • 55
12
votes
3 answers

ggplot2 Color Scale Over Affected by Outliers

I'm having difficulty with a few outliers making the color scale useless. My data has a Length variable that is based in a range, but will usually have a few much larger values. The below example data has 95 values between 500 and 1500, and 5 values…
ARobertson
  • 2,857
  • 18
  • 24
12
votes
4 answers

Outlier detection in data mining

I have a few sets of questions regarding outlier detection: Can we find outliers using k-means and is this a good approach? Is there any clustering algorithm which does not accept any input from the user? Can we use support vector machine or any…
Navin
  • 411
  • 3
  • 9
  • 17
1
2 3
79 80