So I have a dataset consisting of several million rows of trading data that I am trying to apply a filter to as in the following code. The function trade_quant_filter
looks for outliers and then adds the index of all the outliers to a list to deal with later.
def trim_moments(arr, alpha):
np.sort(arr)
n = len(arr)
k = int(round(n*float(alpha))/2)
return np.mean(arr[k+1:n-k]), np.std(arr[k+1:n-k])
def trade_quant_filter(dataset, alpha, window, gamma):
radius = int(round(window /2))
bad_values = []
for count, row in dataset.iterrows():
if count < radius: # Starting case when we can't have a symmetric radius
local_mean, local_std = trim_moments(
dataset['price'][: count + window].values,alpha)
if np.abs(dataset['price'][count] - local_mean) > 3 * local_std + gamma:
bad_values.append(count)
elif count > (dataset.shape[0] - radius): # 2
local_mean, local_std = trim_moments(
dataset['price'][count - window: count].values,alpha)
if np.abs(dataset['price'][count] - local_mean) > 3 * local_std + gamma:
bad_values.append(count)
else:
local_mean, local_std = trim_moments(
dataset['price'][count - radius: count + radius].values,alpha)
if np.abs(dataset['price'][count] - local_mean) > 3 * local_std + gamma: #4
bad_values.append(count)
return bad_values
The problem is that I've written this code too poorly to deal with several million entries. 150k rows takes about 30 seconds:
stats4 = %prun -r trade_quant_filter(trades_reduced[:150000], alpha,window,gamma)
Ordered by: internal time
List reduced from 154 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
600002 3.030 0.000 3.030 0.000 {method 'reduce' of 'numpy.ufunc' objects}
150000 2.768 0.000 4.663 0.000 _methods.py:73(_var)
1 2.687 2.687 40.204 40.204 <ipython-input-102-4f16164d899e>:8(trade_quant_filter)
300000 1.738 0.000 1.738 0.000 {pandas.lib.infer_dtype}
6000025 1.548 0.000 1.762 0.000 {isinstance}
300004 1.481 0.000 6.937 0.000 internals.py:1804(make_block)
300001 1.426 0.000 13.157 0.000 series.py:134(__init__)
300000 1.033 0.000 3.553 0.000 common.py:1862(_possibly_infer_to_datetimelike)
150000 0.945 0.000 2.562 0.000 _methods.py:49(_mean)
300000 0.902 0.000 12.220 0.000 series.py:482(__getitem__)
There are a couple of things that make optimizing this function challenging:
- As far as I can tell, there is no way to avoid iterating row by row here and still take the trimmed rolling means and standard deviation. I plan on looking into how functions like
rolling_mean
are implemented in Pandas next. - The lack of hashability of dictionaries also makes it impossible to calculates the trimmed rolling means and standard deviations so I can't convert the dataframe to a dict.
Using Cython and NDarrays as recommended here seems like a possibility and I'm learning about Cython now.
What is the most straightforward way of optimizing this code? I'm looking for at least a 10x speed improvement.