Remove all outliers from a list and not just one in Python

Question

I try to remove outliers in a python list. But it removes only the first one (190000) and not the second (20000). What is the problem ?

import statistics
dataset = [25000, 30000, 52000, 28000, 150000, 190000, 200000]

def detect_outlier(data_1):
    threshold = 1
    mean_1 = statistics.mean(data_1)
    std_1 = statistics.stdev(data_1)
    #print(std_1)
    for y in data_1:
        z_score = (y - mean_1)/std_1
        print(z_score)
        if abs(z_score) > threshold:
            dataset.remove(y)
    return dataset  
dataset = detect_outlier(dataset)
print(dataset)

Output:

[25000, 30000, 52000, 28000, 150000, 200000]

Shouldn't you update `mean_1` and `std_1` after every removal? — goodvibration, Sep 09 '20 at 11:48
Make a copy: `for y in data_1.copy(): ...` or even better, make a new list and append the items that are not outliers. — Chris, Sep 09 '20 at 11:48
Does this answer your question? [Python: Removing list element while iterating over list](https://stackoverflow.com/questions/6022764/python-removing-list-element-while-iterating-over-list) — DarrylG, Sep 09 '20 at 11:51
dataset is not defined in the scope of your detect_outlier function. Perhaps you meant data_1 instead? — Josh Purtell, Sep 09 '20 at 11:56

score 2 · Answer 1 · answered Sep 09 '20 at 11:51

It is because you are trying to make operations on the same data address. dataset's address is equals to the data_1 address and when you are removing an element from the list, it pass the next element according to the foreach structure of Python. You must not make operations on a list during iteration.

Shortly, try to call the method like this(this sends dataset's elements as a new list, doesn't send the dataset):

dataset = detect_outlier(dataset[:])

score 1 · Answer 2 · answered Sep 09 '20 at 12:09

1

import statistics

def detect_outlier(data_1):
    threshold = 1
    mean_1 = statistics.mean(data_1)
    std_1 = statistics.stdev(data_1)
    result_dataset = [y  for y in data_1 if abs((y - mean_1)/std_1)<=threshold ]

    return result_dataset
if __name__=="__main__":
    dataset = [25000, 30000, 52000, 28000, 150000, 190000, 200000]
    result_dataset = detect_outlier(dataset)
    print(result_dataset)

answered Sep 09 '20 at 12:09

ABDULVAHAB Kharadi

54
3

list comprehensions are definitely the right choice here – Josh Purtell Sep 09 '20 at 12:24

Remove all outliers from a list and not just one in Python

2 Answers2