vectorize complex slicing with pandas dataframe

Question

I'd like to be able to vectorize, for speed purposes, this piece of code. the purpose is to calculate a function, in this case a standard deviation, from a tuple of pair of dates that are cointained in two separate arrays.

import pandas as pd
import numpy as np

asd_1 = pd.Series(0.01 * np.random.randn(252), index=pd.date_range('2011-1-1', periods=252))

index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17',])

index_tot = list(zip(index_1,index_2))

aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc[i:j]) for i, j in index_tot], index=index_1)

the solution, that works, is performed through a loop but i'd rather be able to vectorize it through numpy/pandas, which is much faster. initially I though about using something like:

df_aux = pd.concat([asd_1 for _ in range(len(index_1))], axis=1)
results = df_aux.apply(lambda x: np.nanstd(x.loc[i,j]), axis = 0)

but here I fail to put together the vectors into one operation.

any and all advice is welcome.

p.s.: below there is an image for explanatory purposes

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

Vectorized standard deviation across ranges in an array

def get_ranges_arr(starts,ends):
    # Taken from http://stackoverflow.com/a/37626057/3293881
    counts = ends - starts
    counts_csum = counts.cumsum()
    id_arr = np.ones(counts_csum[-1],dtype=int)
    id_arr[0] = starts[0]
    id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
    return id_arr.cumsum()

def ranged_std(arr,starts,ends):
    # Get all indices and the IDs corresponding to same groups
    idx = get_ranges_arr(starts,ends)
    id_arr = np.repeat(np.arange(starts.size),ends-starts)
    
    # Extract relevant data
    slice_arr = arr[idx]
    
    # Simulate standard deviation implementation for a number of groups
    # using id_arr as the basis to perform various mathematical operations
    # within each group. Since, std. deviation performs sum/mean reduction,
    # we can simply use np.bincount for an efficient implementation.
    # Std. deviation formula used :
    #https://github.com/numpy/numpy/blob/v1.11.0/numpy/core/fromnumeric.py#L2939
    grp_counts = np.bincount(id_arr)
    mean_vals = np.bincount(id_arr,slice_arr)/grp_counts
    abs_vals = np.abs(slice_arr - mean_vals[id_arr])**2
    return np.sqrt(np.bincount(id_arr,abs_vals)/grp_counts)

Sample run (verify against a loopy version)

In [173]: arr = np.random.randint(0,9,(20))

In [174]: starts = np.array([2,6,11])

In [175]: ends = np.array([8,9,15])

In [176]: [np.std(arr[i:j]) for i,j in zip(starts,ends)]
Out[176]: [1.9720265943665387, 0.81649658092772603, 0.82915619758884995]

In [177]: ranged_std(arr,starts,ends)
Out[177]: array([ 1.97202659,  0.81649658,  0.8291562 ])

Runtime test

Case #1 : Very small number of ranges 3

In [21]: arr = np.random.randint(0,9,(20))

In [22]: starts = np.array([2,6,11])

In [23]: ends = np.array([8,9,15])

In [24]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
10000 loops, best of 3: 146 µs per loop

In [25]: %timeit ranged_std(arr,starts,ends)
10000 loops, best of 3: 45 µs per loop

Case #2 : Decent number of ranges 1000

In [32]: arr = np.random.randint(0,9,(1010))

In [33]: starts = np.random.randint(0,9,(1000))

In [34]: ends = starts + np.random.randint(0,9,(1000))

In [35]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
10 loops, best of 3: 47.5 ms per loop

In [36]: %timeit ranged_std(arr,starts,ends)
1000 loops, best of 3: 217 µs per loop

Case #3 : Large number of ranges 10000

In [60]: arr = np.random.randint(0,9,(1010))

In [61]: arr = np.random.randint(0,9,(10010))

In [62]: starts = np.random.randint(0,9,(10000))

In [63]: ends = starts + np.random.randint(0,9,(10000))

In [64]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
1 loops, best of 3: 474 ms per loop

In [65]: %timeit ranged_std(arr,starts,ends)
100 loops, best of 3: 2.17 ms per loop

Really amazing speedups of 200x+!

Using ranged_std to solve our case

# Get start, stop numeric indices as needed for getting ranges array later on
starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)

# Create final dataframe output using ranged_std func
df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1)

Sample run for verification -

In [17]: asd_1 = pd.Series(0.01 * np.random.randn(252), index=\
    ...:                   pd.date_range('2011-1-1', periods=252))
    ...: 
    ...: index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1',])
    ...: index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17',])
    ...: 
    ...: index_tot = list(zip(index_1,index_2))
    ...: aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc[i:j]) for i, j in \
    ...:                                                index_tot], index=index_1)
    ...: 

In [18]: starts = asd_1.index.searchsorted(index_1)
    ...: ends = asd_1.index.searchsorted(index_2)
    ...: df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1)
    ...: 

In [19]: aux_learning_std
Out[19]: 
                   0
2011-02-02  0.007244
2011-04-03  0.012862
2011-05-01  0.010155

In [20]: df
Out[20]: 
                   0
2011-02-02  0.007244
2011-04-03  0.012862
2011-05-01  0.010155

@Asher11 Was complicated, but good challenge! Please note that this has good setup cost, so it would only make sense when dealing with a a good number of ranges. Also, if you are looking to ignore `NaNs` with that `np.nanstd`, we gotta use `np.isnan` and remove those in the implementation of `ranged_std`. Good luck! — Divakar, Aug 16 '16 at 11:21
@Asher11 Well I was wrong! The speedups are great across the entire range, even with all that setup and even for a very small number of ranges! Check out the edits. — Divakar, Aug 16 '16 at 12:07
I noticed! it's really a great boost. If I may be so bold, colud you also provide a snippet to use to plug-in in the right spot for general functions? normally I would do it myself but your expertise of vectorized topics is way out of my depth (as I strongly think the rest of the python realm for that matter) and I struggle to addomesticate the code you wrote for the standard deviation case without losing all the benefits you provided (granted, general functions will lose the extreme speed of your implementation) — Asher11, Aug 16 '16 at 12:21
@Asher11 I would say that you could use upto the first three lines (until `slice_arr = arr[idx]`) of `ranged_std` as it is for any function. After that point, whether a generic func could be ported in such a vectorized fashion or not would depend on the generic func itself. For example, in this case, we have used `np.bincount` to perform the sum/mean calculations as done internally for std. deviation calculations, which might not be applicable for any other generic func. So, after those three lines, either you gotta think more or ask here on SO! Good luck :) — Divakar, Aug 16 '16 at 12:23
@Asher11 Would like to add that you can use `binned_statistic` to use some in-built funcs and also use a custom one using ID array, which we generated in this case as `id_arr` and we need to use the data would be `sliced_arr`. See a short discussion here : http://stackoverflow.com/a/34618404/3293881 — Divakar, Aug 16 '16 at 12:59
Looking at this amazing answer I have a quick question @Divakar do you have any good resources for understanding vectorization at a more comprehensive level? — SerialDev, Aug 16 '16 at 13:00
@SerialDev Hmm, good question and I wish I could point you at something useful. Most of my vectorizing skills I have gathered, I have done it right here on SO by answering vectorization tagged questions and keeping a mindset of not using loops, even though those look straight-forward and quick to implement most of the times. I did find one useful blog that I could point out that relates to NumPy vectorization [`here`](http://ipython-books.github.io/featured-01/). Other than that, I would say just keep practicing and following relevant tagged questions on SO! :) — Divakar, Aug 16 '16 at 13:17
@Divakar, could using strides from this answer help and how would it compare. http://stackoverflow.com/questions/4923617/efficient-numpy-2d-array-construction-from-1d-array — Merlin, Aug 16 '16 at 15:05
@Merlin Yeah, that's one thing I haven't really explored in NumPy. I have seen people use it to optimize stuffs in NumPy. Will let you know if I find something going with it! — Divakar, Aug 16 '16 at 19:35
@Divakar, thanks.... use strides for moving avg windows, not much else. — Merlin, Aug 16 '16 at 20:59
@Merlin Yeah exactly, that's what I figured about strides based on a quick glance. These binned operations seem to be a different ball game. — Divakar, Aug 16 '16 at 21:01

vectorize complex slicing with pandas dataframe

1 Answers1

Vectorized standard deviation across ranges in an array

Linked