4

I am having a pandas dataframe with several of speed values which is continuously moving values, but its a sensor data, so we often get the errors in the middle at some points the moving average seems to be not helping also, so what methods can I use to remove these outliers or peak points from the data?

Example:

data points = {0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}

in this data If I see the points 4, 4, 5, 6 are completely outlier values, before I have used the rolling mean with 5 min of window frame to smooth these values but still I am getting these type of a lot of blip points, which I want to remove, can any one suggest me any technique to get rid of these points.

I have an image which is more clear view of data: enter image description here

if you see here how the data is showing some outlier points which I have to remove? any Idea whats the possible way to get rid of these points?

Community
  • 1
  • 1
id101112
  • 1,012
  • 2
  • 16
  • 28
  • You could calculate the z-score of all points and reject above some threshold. – ALollz Jun 24 '18 at 01:04
  • @ALollz that works in the case when your normal distribution lies on both sides, but here I won't have any value below zero, or speed will never go in negative, still Is the right technique to use the z-score in this case ...??? – id101112 Jun 24 '18 at 01:50
  • Oh good point, that data will not be normal. Do you have a sense of what the underlying distribution should be empirically? – ALollz Jun 24 '18 at 03:43
  • Here’s a link that may be of use: [outlier detection on skewed distributions](https://stats.stackexchange.com/questions/129274/outlier-detection-on-skewed-distributions) – ALollz Jun 24 '18 at 03:52

2 Answers2

3

I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]

# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')

# Function to identify and remove outliers
def outliers(df, level):

    # 1. temporary dataframe
    df = df1.copy(deep = True)

    # 2. Select a level for a Z-score to identify and remove outliers
    df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
    ix_keep = df_Z.index

    # 3. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df.loc[ix_keep]

    return(df_keep)

Originial data:

enter image description here

Test run 1 : Z-score = 4:

enter image description here

As you can see, no data has been removed because the level was set too high.

Test run 2 : Z-score = 2:

enter image description here

Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.

Test run 3 : Z-score = 1.2:

enter image description here

This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!

EDIT: Here's the whole thing for an easy copy&paste:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]

# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')

# Function to identify and remove outliers
def outliers(df, level):

    # 1. temporary dataframe
    df = df1.copy(deep = True)

    # 2. Select a level for a Z-score to identify and remove outliers
    df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
    ix_keep = df_Z.index

    # 3. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df.loc[ix_keep]

    return(df_keep)

# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)

# final plot
df_clean.plot(style = 'o')
vestland
  • 55,229
  • 37
  • 187
  • 305
  • @id101112 Did this solve your problems? Let me know if now and I'll have a look at it again. – vestland Jun 27 '18 at 06:30
  • sorry for late reply, I did worked on zscore approach, but i worked in some different way, Thank you so much for your kind reply – id101112 Nov 03 '18 at 01:17
1

You might cut values above a certain quantile as follows:

import numpy as np
clean_data=np.array(data_points)[(data_points<=np.percentile(data_points, 95))]

In pandas you would use df.quantile, you can find it here

Or you may use the Q3+1.5*IQR approach to eliminate the outliers, like you would do through a boxplot

Dav2357
  • 134
  • 1
  • 5
  • i used these both techniques before that seems to not be working with my data, that's why I am still trying to figure out what would be the other good technique to take out only those highest points. I used z-score and also used the IQR method to get rid of these points. – id101112 Jun 24 '18 at 16:15