I'm working with a dataframe with 800k records and want to build a faster way of getting the index values between a sum condition vs. my current working version running df.iterrows(). I have a column that I want to sum up to a value and then reset the sum index starting positions to start the cycle over.
A simple example would be any summed value over 1000.
series = [193, 371, 163, 287, 627, 323, 382, 263, 361, 501, 282, 411, 335, 528, 396, 465, 309, 243, 348, 387, 416, 446, 464, 227, 301]
series = pd.Series(series)
target_value = 1000
start = 0
end = 0
sum_values = series[start:end+1].sum()
index_values = []
for indx, val in series.iteritems():
if sum_values >= target_value:
index_values.append([start, end])
sum_values = series[end+1]
start = end
end += 1
else:
sum_values = series[start:end+1].sum()
end += 1
index_values:
[[0, 4], [4, 7], [7, 10], [10, 13], [13, 16], [16, 20], [20, 23]]
I cannot figure out how I can do this with numpy where function with a reset once a value has been surpassed.
Any help in a direction would be much appreciated.