New answer
The OP had left out a critical detail: rows in order cannot be skipped if their contribution to the cumulative sum makes it not exceed the maximum. This transforms the question substantially, and the original answer can no longer be used.
In fact, there is no vectorized operation in the current pandas
or numpy
that will achieve such a cumulative sum with resets at positions unknown beforehand.
As indicated in this SO answer, the best approach that's left for large arrays is to use numba
.
Here is how to do this, adapted for this problem:
from numba import njit
from numba.types import bool_
@njit
def conditional_cumsum(x, skip_if_higher_than):
total = 0
result = np.zeros_like(x, dtype=bool_)
for i, y in enumerate(x):
if total + y <= skip_if_higher_than:
total += y
result[i] = True
return result
Examples
df = pd.DataFrame({'A': [2, 5, 9, 4, 3], 'B': [5, 4, 3, 3, 1]})
mask = conditional_cumsum(df['A'].values, skip_if_higher_than=15)
>>> df.loc[mask]
A B
0 2 5
1 5 4
3 4 3
4 3 1
df = pd.DataFrame({'A': [5, 5, 5, 1], 'B': [5, 4, 3, 1]})
>>> conditional_cumsum(df['A'].values, skip_if_higher_than=15)
array([ True, True, True, False])
Speed
n = 1_000_000
np.random.seed(0)
df = pd.DataFrame({
'A': np.random.uniform(size=n),
'B': np.arange(n)
})
%timeit conditional_cumsum(df['A'].values, skip_if_higher_than=1)
460 µs ± 751 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit conditional_cumsum(df['A'].values, skip_if_higher_than=100)
460 µs ± 939 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> df.loc[conditional_cumsum(df['A'].values, skip_if_higher_than=1)]
A B
0 5.488135e-01 0
4 4.236548e-01 4
16 2.021840e-02 16
99 4.695476e-03 99
757 1.383350e-03 757
821 5.459649e-04 821
1070 6.642186e-04 1070
84341 3.310554e-06 84341
131245 1.989694e-05 131245
661553 7.071203e-07 661553
Please note: there is an initial "compile time" for any njit
function. Run it first on a small array to let that happen, then on the big arrays.
Original answer
Since you can skip over rows, the order of which columns are selected is unimportant (we can recover it afterward). The maximum subset will be, as @MuhammadHassan said, the part of A
, sorted, that sums up to 15:
s = df['A'].sort_values().cumsum() <= 15
idx = df.index.intersection(s[s].index)
>>> idx.tolist()
[0, 1, 3, 4]
# and
>>> df.loc[idx]
A B
0 2 5
1 5 4
3 4 3
4 3 1
Note on the original answer
I'll leave this original answer for didactic purposes, but @MuhammadHassan's answer is correct and more concise. To prevent the UserWarning: Boolean Series key will be reindexed to match DataFrame index
(and to select up to 15, which means up to and including 15):
>>> df.loc[df['A'].sort_values().cumsum() <= 15]
A B
0 2 5
1 5 4
3 4 3
4 3 1