apply generic function in a vectorized fashion using numpy/pandas

Question

I am trying to vectorize my code and, thanks in large part to some users (https://stackoverflow.com/users/3293881/divakar, https://stackoverflow.com/users/625914/behzad-nouri), I was able to make huge progress. Essentially, I am trying to apply a generic function (in this case max_dd_array_ret) to each of the bins I found (see vectorize complex slicing with pandas dataframe for details on date vectorization and Start, End and Duration of Maximum Drawdown in Python for the rationale behind max_dd_array_ret). the problem is the following: I should be able to obtain the result df_2 and, to some degree, ranged_DD(asd_1.values, starts, ends+1) is what I am looking for, except for the tragic effect that it's as if the first two bins are merged and the last one is missing as it can be gauged by looking at the results.

any explanation and fix is very welcomed

import pandas as pd
import numpy as np
from time import time
from scipy.stats import binned_statistic

def max_dd_array_ret(xs):
    xs = (xs+1).cumprod()
    i = np.argmax(np.maximum.accumulate(xs) - xs) # end of the period
    j = np.argmax(xs[:i])
    max_dd = abs(xs[j]/xs[i] -1)
    return max_dd if max_dd is not None else 0

def get_ranges_arr(starts,ends):
    # Taken from https://stackoverflow.com/a/37626057/3293881
    counts = ends - starts
    counts_csum = counts.cumsum()
    id_arr = np.ones(counts_csum[-1],dtype=int)
    id_arr[0] = starts[0]
    id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
    return id_arr.cumsum()

def ranged_DD(arr,starts,ends):
    # Get all indices and the IDs corresponding to same groups
    idx = get_ranges_arr(starts,ends)
    id_arr = np.repeat(np.arange(starts.size),ends-starts)

    slice_arr = arr[idx]
    return binned_statistic(id_arr, slice_arr, statistic=max_dd_array_ret)[0]

asd_1 = pd.Series(0.01 * np.random.randn(500), index=pd.date_range('2011-1-1', periods=500)).pct_change()

index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1','2011-7-2', '2011-8-3', '2011-9-1','2011-10-2', '2011-11-3', '2011-12-1','2012-1-2', '2012-2-3', '2012-3-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17','2011-7-17', '2011-8-17', '2011-9-17','2011-10-17', '2011-11-17', '2011-12-17','2012-1-17', '2012-2-17', '2012-3-17',])

starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)

df_2 = pd.DataFrame([max_dd_array_ret(asd_1.loc[i:j]) for i, j in zip(index_1, index_2)], index=index_1)

print(df_2[0].values)
print(ranged_DD(asd_1.values, starts, ends+1))

results:

df_2
[ 1.75893509  6.08002911  2.60131797  1.55631781  1.8770067   2.50709085
  1.43863472  1.85322338  1.84767224  1.32605754  1.48688414  5.44786663]
ranged_DD(asd_1.values, starts, ends+1)
[ 6.08002911  2.60131797  1.55631781  1.8770067   2.50709085  1.43863472
  1.85322338  1.84767224  1.32605754  1.48688414]

which are identical except for the first two: [ 1.75893509 6.08002911 vs [ 6.08002911 and the last two 1.48688414 5.44786663] vs 1.48688414]

p.s.:while looking in more detail at the docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html) I found that this might be the problem

"All but the last (righthand-most) bin is half-open. In other words, if bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4. New in version 0.11.0."

problem is I don't how to reset it.

Honestly, you will be hard pressed to find any imporvements by using `numpy.vectorize` over the simple `pandas.DataFrame.apply` or the `pandas.Series.apply`. I have considerable experience with the two, and I can say confidently that most of the time, `numpy.vectorize` will ruin performance because it does not recognize `pandas` as a datatype. — Kartik, Aug 17 '16 at 02:56
That said, use `.values` when passing arguments so that you pass instances of `numpy.ndarray` instead of `pandas.Series`. So this statement: `max_dd_array_ret(asd_1.loc[i:j])` becomes `max_dd_array_ret(asd_1.loc[i:j].values)` — Kartik, Aug 17 '16 at 02:58
Good to know. still my main concern is that the two different results are the same except for the First two and last two elements that get clumped togethee for no apparent reason — Asher11, Aug 17 '16 at 06:47

apply generic function in a vectorized fashion using numpy/pandas

0 Answers0