0

I need to iterate over a lot of itemsets and need to remove the outlier. As a threshold, I simply use the standard val > 3* σ. Currently, I have the following solution:

def remove_outlier(data):
    data_t = np.array(data).T.tolist()
    
    for ele in data_t:
        temp = []
        for val in ele:
            if (val < (3 * np.std(ele)) + np.mean(ele)) and (val > (np.mean(ele) - 3 * np.std(ele))):
                temp.append(val)
        data_t[i] = np.asarray(temp)
    data = np.asarray(data_t).T
    return data

I'm looking for a faster solution, because it takes up to 7 seconds per dataset (foreseeable for a double for-loop).

I've come across scipy's z-score method and since it also supports the axis=1 argument, it seems more valuable and faster than my solution. Is there a shortcut of how I can remove the corresponding z-scores from my dataset? I played around with numpy.where(), but it returns only certain values if compared above a threshold.

The shape of the data is usually around 1000x8, but can also be transposed without any problem.

Lukas S
  • 315
  • 1
  • 3
  • 15
  • first aproach calculate meanEle and stdEle outside the for val in ele loop, this will give you some air... also use abs(val-meanEle) < 3*stdEle in the if – Ulises Bussi Nov 09 '21 at 18:48
  • you are trying to remove the value so,you cannot use a numpy array bc you will no longer have the same size for each row. am I right? – Ulises Bussi Nov 09 '21 at 18:51
  • One of the other technical methods for removing outliers is to removal all v < median - 1.5*IQR, v >median + 1.5*IQR. Where IQR is the interquartile range. See: https://online.stat.psu.edu/stat200/lesson/3/3.2 for IQR and https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list for an example of a similar question to yours. – Larry the Llama Nov 09 '21 at 19:08
  • @UlisesBussi yes i do a FFT per transposed row (shape of 8x1000) in the next step for each row. – Lukas S Nov 10 '21 at 09:40
  • But if you remove the value, the rows will no longer have 1000 values – Ulises Bussi Nov 10 '21 at 14:19
  • Yeah thats why i transpose into lists. Not every dataset has a length of 1000. 1000 is just some estimate. Normally the data is 700-1400 values per row. On a earlier approach I did process the whole array at once, but you can't FFT when the rows/cols have not equal datasize. Therefore I have to do it per row (8 in total per dataset) But thank you. Doing the calculation outside the inner for-loop did speed up the code. Didn't think it would eat that much time away. – Lukas S Nov 10 '21 at 17:50

0 Answers0