-1

Suppose I have a column like so (based on some sort applied on B column):

   A   B
0  2   5   
1  5   4
2  9   3
3  4   3
4  3   1

Now I have to find out the sum up to 15 but the order of the rows in dataframe cannot change (it needs to be sorted by column B values)(can omit a row though while calculating sum), so in this case row sum(0, 1 , 3 , 4)over column A = 14. Basically conditional summation over a column.

Expected Output: df whose column A additions of rows <= N (15 here).

NOTE:

I can skip only those rows whose value of A > total_left, have to always consider those rows in that order only (prioritized by B), for eg, there can be millions of rows in the dataframe and I have to definitely consider rows whose value is < total_left (total - sum_till_now), I cannot leave these rows, so for example I have something like this

   A  B
0  5  5
1  5  4
3  5  3
4  1  1

I cannot take 4th row first because it has less value, I need to take row(0), row(1), row(2), therefore sorting on column(A) isn't possible.

Himanshu Jotwani
  • 378
  • 2
  • 16

2 Answers2

2

New answer

The OP had left out a critical detail: rows in order cannot be skipped if their contribution to the cumulative sum makes it not exceed the maximum. This transforms the question substantially, and the original answer can no longer be used.

In fact, there is no vectorized operation in the current pandas or numpy that will achieve such a cumulative sum with resets at positions unknown beforehand.

As indicated in this SO answer, the best approach that's left for large arrays is to use numba.

Here is how to do this, adapted for this problem:

from numba import njit
from numba.types import bool_

@njit
def conditional_cumsum(x, skip_if_higher_than):
    total = 0
    result = np.zeros_like(x, dtype=bool_)
    for i, y in enumerate(x):
        if total + y <= skip_if_higher_than:
            total += y
            result[i] = True
    return result

Examples

df = pd.DataFrame({'A': [2, 5, 9, 4, 3], 'B': [5, 4, 3, 3, 1]})
mask = conditional_cumsum(df['A'].values, skip_if_higher_than=15)

>>> df.loc[mask]
   A  B
0  2  5
1  5  4
3  4  3
4  3  1


df = pd.DataFrame({'A': [5, 5, 5, 1], 'B': [5, 4, 3, 1]})

>>> conditional_cumsum(df['A'].values, skip_if_higher_than=15)
array([ True,  True,  True, False])

Speed

n = 1_000_000
np.random.seed(0)
df = pd.DataFrame({
    'A': np.random.uniform(size=n),
    'B': np.arange(n)
})

%timeit conditional_cumsum(df['A'].values, skip_if_higher_than=1)
460 µs ± 751 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit conditional_cumsum(df['A'].values, skip_if_higher_than=100)
460 µs ± 939 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> df.loc[conditional_cumsum(df['A'].values, skip_if_higher_than=1)]
                   A       B
0       5.488135e-01       0
4       4.236548e-01       4
16      2.021840e-02      16
99      4.695476e-03      99
757     1.383350e-03     757
821     5.459649e-04     821
1070    6.642186e-04    1070
84341   3.310554e-06   84341
131245  1.989694e-05  131245
661553  7.071203e-07  661553

Please note: there is an initial "compile time" for any njit function. Run it first on a small array to let that happen, then on the big arrays.

Original answer

Since you can skip over rows, the order of which columns are selected is unimportant (we can recover it afterward). The maximum subset will be, as @MuhammadHassan said, the part of A, sorted, that sums up to 15:

s = df['A'].sort_values().cumsum() <= 15
idx = df.index.intersection(s[s].index)

>>> idx.tolist()
[0, 1, 3, 4]

# and
>>> df.loc[idx]
   A  B
0  2  5
1  5  4
3  4  3
4  3  1

Note on the original answer

I'll leave this original answer for didactic purposes, but @MuhammadHassan's answer is correct and more concise. To prevent the UserWarning: Boolean Series key will be reindexed to match DataFrame index (and to select up to 15, which means up to and including 15):

>>> df.loc[df['A'].sort_values().cumsum() <= 15]
   A  B
0  2  5
1  5  4
3  4  3
4  3  1
Pierre D
  • 24,012
  • 7
  • 60
  • 96
1

IIUC:

df = df[df['A'].sort_values().cumsum() < 15]

OUTPUT

   A  B
0  2  5
1  5  4
3  4  3
4  3  1
Muhammad Hassan
  • 4,079
  • 1
  • 13
  • 27
  • you have understood it partially correct but the problem is can't sort based on column A, since the priority (sorting) of column B needs to be there (pre requisite), if I will sort based on column A, the priority (sorting) of column B gets disturbed. – Himanshu Jotwani Dec 26 '21 at 17:31
  • 1
    this is better than my approach: more concise. However you need to use `df.loc[...]` in order to avoid the `UserWarning`. Also, "up to" means "and including". – Pierre D Dec 26 '21 at 18:02