I need to iterate over a lot of itemsets and need to remove the outlier. As a threshold, I simply use the standard val > 3* σ. Currently, I have the following solution:
def remove_outlier(data):
data_t = np.array(data).T.tolist()
for ele in data_t:
temp = []
for val in ele:
if (val < (3 * np.std(ele)) + np.mean(ele)) and (val > (np.mean(ele) - 3 * np.std(ele))):
temp.append(val)
data_t[i] = np.asarray(temp)
data = np.asarray(data_t).T
return data
I'm looking for a faster solution, because it takes up to 7 seconds per dataset (foreseeable for a double for-loop).
I've come across scipy's z-score
method and since it also supports the axis=1
argument, it seems more valuable and faster than my solution. Is there a shortcut of how I can remove the corresponding z-scores from my dataset?
I played around with numpy.where()
, but it returns only certain values if compared above a threshold.
The shape of the data is usually around 1000x8
, but can also be transposed without any problem.