0

I'm working with a dataframe with 800k records and want to build a faster way of getting the index values between a sum condition vs. my current working version running df.iterrows(). I have a column that I want to sum up to a value and then reset the sum index starting positions to start the cycle over.

A simple example would be any summed value over 1000.

series = [193, 371, 163, 287, 627, 323, 382, 263, 361, 501, 282, 411, 335, 528,  396, 465, 309, 243, 348, 387, 416, 446, 464, 227, 301]

series = pd.Series(series)
target_value = 1000
start = 0
end = 0
sum_values = series[start:end+1].sum()
index_values = []
for indx, val in series.iteritems():
    if sum_values >= target_value:
        index_values.append([start, end])
        sum_values = series[end+1]
        start = end
        end += 1
    else:
        sum_values = series[start:end+1].sum()
        end += 1

index_values:
[[0, 4], [4, 7], [7, 10], [10, 13], [13, 16], [16, 20], [20, 23]]

I cannot figure out how I can do this with numpy where function with a reset once a value has been surpassed.

Any help in a direction would be much appreciated.

BubaSkrimp
  • 53
  • 1
  • 8
  • 2
    There isn't a vectorized method to handle this directly, look at `numba` for the iterative solution, should be just as fast. Please post a [mcve] and review [ask]. – user3483203 Feb 13 '20 at 18:36
  • `s = pd.cut(df.index, [0,26,78,85], include_lowest=True); df.groupby(s).sum()` – Quang Hoang Feb 13 '20 at 18:42
  • 2
    https://stackoverflow.com/questions/56904390/restart-cumsum-and-get-index-if-cumsum-more-than-value If that answers it probably close as a dup. You're unlikely to attract better answers than that question did. – ALollz Feb 13 '20 at 18:45
  • @ALollz I did not see that post. I will check it out. Thanks – BubaSkrimp Feb 13 '20 at 19:06
  • 1
    @ALollz The link you provided above was very helpful. Thank you – BubaSkrimp Feb 16 '20 at 15:46

1 Answers1

0

the only thing i see you can do is accelerating this process by have a better guess on which is the size of your interval and by making some multi processing.

You can use np.split to make a good initial guess, like:

import numpy as np
interval_size = 10
values = np.random.rand(1000)*10
np.array(np.split(values,interval_size)).sum(axis=0)

out:

array([4630.79410889, 5251.52550577, 4810.02960764, 5215.12946275,
       4962.31855639, 4725.5437016 , 4710.573337  , 4990.37905624,
       5324.98906685, 5105.86335981])

and iterate until you find optimal initial values, like:

interval = []
cost = []
for i in range(10,100,1):
    try:
        values = np.random.rand(1000)*10
        bigger = np.array(np.split(values,i)).sum(axis=0)>100
        smaller = np.array(np.split(values,i)).sum(axis=0)<200
        cost.append(sum(bigger*smaller)/len(smaller))
        interval.append(i)
    except: pass
cost
Out[73]: [0.0, 0.5, 1.0, 0.32, 0.0]

interval
Out[74]: [10, 20, 25, 40, 50]

in this case, 25 is the best initial guess for the interval size

Guinther Kovalski
  • 1,629
  • 1
  • 7
  • 15
  • 1
    For my purposes it is important to have the exact cutoff. I think you have an interesting concept here. I ended up following the link provided by @ALollz and rebuilt the functions using numpy vs. pandas with acceptable times now. – BubaSkrimp Feb 16 '20 at 16:45