1

I am trying to find an algorithm in python which finds outliers based on the a) tendency of the dataset (growing/shrinking) b) and based on the numerical distance from its immediate neighbours: x does not exceed > abs(1%) of x-1

Edit: It can be cubic spline interpolated as well, if there is an algorithm in python available.

I have seen the numpy algorithm which uses the std deviation but since i have to consider the series of the list, this algorithm does not work for this case.

0.0937,
0.0934,
0.0933,
0.0931,
0.0933,
0.0936,
0.1091 < == outlier,
0.0938,
0.0945,
0.0949,
0.0956,
0.1082, 
0.1065 < == outlier since -,
0.1123,
0.1198
Anne
  • 11
  • 5
  • 4
    Before looking for an algorithm you should define what an "outlier" actually is. – Klaus D. Jul 30 '15 at 16:06
  • I was hoping the time series i have given illustrates that: 0.0937, 0.0934, 0.0933, 0.0931, 0.0933, 0.0936, 0.1091 < == outlier, 0.0938, 0.0945, 0.0949, 0.0956, 0.1082, 0.1065 < == outlier since -, 0.1123, 0.1198. Therefore: an outlier is a number not fitting with the tendency of the list (while the tendency of the list can change over multiple rows) or having >1% of change to the immediate neighbour. – Anne Jul 30 '15 at 16:09
  • 4
    You will not need examples, you will need a mathematical definition, e.g. "A value is considered to be an outlier if…" – Klaus D. Jul 30 '15 at 16:11
  • You may need to use two algorithms, numpy and your own, on the data. – Brent Washburne Jul 30 '15 at 16:12
  • @Klaus: So please tell me what is unclear in: an outlier is a number not fitting with the tendency of the list (while the tendency of the list can change over multiple rows) or having >1% of change to the immediate neighbour. – Anne Jul 30 '15 at 16:16
  • 1
    How are you defining outlier? Outlier as in outside 2 standard deviations? Outside the center 50 percentile? < y or > x ? – Steven Jul 30 '15 at 16:19
  • 1
    "Tendency" is also not clear. Linear regression? Cubic spline? –  Jul 30 '15 at 16:25
  • 2
    To my eyes, you're wrong with respect to the "eyeball" definition of outlier: it's not 0.1065 which is weird because it's too low, it's 0.1082 which looks weird because it's too high. We'd only have to move 0.1082 to get a clean curve, whereas if we assume it's 0.1065 which is too low we'd have to change 0.1123 and 0.1198 as well. This is why it's important to be specific about your criteria. – DSM Jul 30 '15 at 16:47
  • did any of the answers bellow help solving your problem? If it helped, please accept it to help those who have similar problem. If it did not help, please let me know so that I can delete mine to save people from wasting their time looking at it. – innoSPG Jan 26 '16 at 17:51

2 Answers2

0

What you can do is compute backward and forward gradient of your data assuming a constant step of 1. Your outliers are those elements where those conditions apply:

  • backward and forward gradients do not have the same sign: change of tendency
  • absolute value of backward gradient greater than 1% of the absolute value of the left neighbor

My interpretation of your statement is that both must be True.

Let f be a 1-D numpy array of your data.

f=np.array([
0.0937,
0.0934,
0.0933,
0.0931,
0.0933,
0.0936,
0.1091, #< == outlier,
0.0938,
0.0945,
0.0949,
0.0956,
0.1082, 
0.1065, #< == outlier since -,
0.1123,
0.1198
])
bg = 0.0*f # backward gradient, we want them to have the same size as f
fg = 0.0*f # forward gradient, we want them to have the same size as f
bg[1:] = f[1:]-f[:-1]
fg[:-1] = f[1:]-f[:-1]

outliers = (bg*fg<0) * np.hstack((False, np.where(np.abs(bg[1:])>0.01*np.abs(f[:-1]),True,False) ))
# You don't want to remove an element and the next
outliers[1:] = outliers[1:]*np.where( outliers[:-1], False, True )

print 'Outliers = ', f[outliers]
print 'Good = ', f[np.where( outliers, False, True)]

I did the example with your data, just replace f by whatever.

innoSPG
  • 4,588
  • 1
  • 29
  • 42
0

If you want Python algorithms for monotonically increasing data, see this question:

Python - How to check list monotonicity

In particular, this answer uses numpy:

https://stackoverflow.com/a/4983495/584846

You can use this in combination with the numpy algorithm for std deviation.

Community
  • 1
  • 1
Brent Washburne
  • 12,904
  • 4
  • 60
  • 82