Fastest way to eleminate outlier in 2D array in python without pandas

Question

I need to iterate over a lot of itemsets and need to remove the outlier. As a threshold, I simply use the standard val > 3* σ. Currently, I have the following solution:

def remove_outlier(data):
    data_t = np.array(data).T.tolist()
    
    for ele in data_t:
        temp = []
        for val in ele:
            if (val < (3 * np.std(ele)) + np.mean(ele)) and (val > (np.mean(ele) - 3 * np.std(ele))):
                temp.append(val)
        data_t[i] = np.asarray(temp)
    data = np.asarray(data_t).T
    return data

I'm looking for a faster solution, because it takes up to 7 seconds per dataset (foreseeable for a double for-loop).

I've come across scipy's z-score method and since it also supports the axis=1 argument, it seems more valuable and faster than my solution. Is there a shortcut of how I can remove the corresponding z-scores from my dataset? I played around with numpy.where(), but it returns only certain values if compared above a threshold.

The shape of the data is usually around 1000x8, but can also be transposed without any problem.

first aproach calculate meanEle and stdEle outside the for val in ele loop, this will give you some air... also use abs(val-meanEle) < 3*stdEle in the if — Ulises Bussi, Nov 09 '21 at 18:48
you are trying to remove the value so,you cannot use a numpy array bc you will no longer have the same size for each row. am I right? — Ulises Bussi, Nov 09 '21 at 18:51
One of the other technical methods for removing outliers is to removal all v < median - 1.5*IQR, v >median + 1.5*IQR. Where IQR is the interquartile range. See: https://online.stat.psu.edu/stat200/lesson/3/3.2 for IQR and https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list for an example of a similar question to yours. — Larry the Llama, Nov 09 '21 at 19:08
@UlisesBussi yes i do a FFT per transposed row (shape of 8x1000) in the next step for each row. — Lukas S, Nov 10 '21 at 09:40
But if you remove the value, the rows will no longer have 1000 values — Ulises Bussi, Nov 10 '21 at 14:19
Yeah thats why i transpose into lists. Not every dataset has a length of 1000. 1000 is just some estimate. Normally the data is 700-1400 values per row. On a earlier approach I did process the whole array at once, but you can't FFT when the rows/cols have not equal datasize. Therefore I have to do it per row (8 in total per dataset) But thank you. Doing the calculation outside the inner for-loop did speed up the code. Didn't think it would eat that much time away. — Lukas S, Nov 10 '21 at 17:50

Fastest way to eleminate outlier in 2D array in python without pandas

0 Answers0