How to split a Pandas DataFrame into multiple DataFrame by the sum of a certain value?

Question

I have a DataFrame that looks something like this:

    Name      Count
0   John      0.25
1   Adam      0.75
2   Michael   1.5
3   Jane      0.8
4   Anna      2.0
5   Sarah     0.25

My objective is to split this DataFrame into multiple DataFrames based on a limit value. For this example, the limit value will be 3 - so the result would look like:

    Name      Count
0   John      0.25
1   Adam      0.75
2   Michael   1.5


   Name      Count
0  Jane      0.8
1  Anna      2.0

   Name      Count
0  Sarah     0.25

The point is that in each of the new DataFrames, the sum of the counts approaches the limit value, but does not exceed it (i.e. Jane is in the second DataFrame, because if she was included in the first, the sum would be 3.3, which exceeds the limit value of 3).

I believe this is possible with .iterrows, but that's decently heavy/slow, so looking for another solution.

Can you post a reproducible example, so it will be easier to work with ? — UGuntupalli, Jul 22 '20 at 20:38

ansev · Answer 1 · 2020-07-22T21:12:20.457

0

We could do,

LIMIT = 3
assert df['Count'].le(LIMIT).all()

groups = []
sum = 0
group = 0

for val in df['Count']:
    sum += val
    if sum > LIMIT:
        group += 1
        sum = val
    groups.append(group)
    
my_dict = {f'Group {i}' : group.reset_index(drop=True) 
           for i, group in df.groupby(groups)}
#print(my_dict['Group 0'])

edited Jul 22 '20 at 21:12

answered Jul 22 '20 at 20:52

ansev

30,322
5
17
31

score 0 · Answer 2 · answered Jul 22 '20 at 21:39

For speed up, we can use numba which is just in time compiliation (JIT) and will be very efficient:

Function adjusted from another answer of mine

from numba import njit
import numpy as np

@njit
def cumsum_reset(array, limit):
    total = 0
    counter = 0
    groups = np.empty(array.shape[0])
    
    for idx, i in enumerate(array):
        total += i
        if total >= limit:
            total = 0
            total += i
            counter += 1
            groups[idx] = counter
        else:
            groups[idx] = counter

    return groups

groups = cumsum_reset(df['Count'].to_numpy(), 3)

for _, grp in df.groupby(groups):
    print(grp)

      Name  Count
0     John   0.25
1     Adam   0.75
2  Michael   1.50
   Name  Count
3  Jane    0.8
4  Anna    2.0
    Name  Count
5  Sarah   0.25

How to split a Pandas DataFrame into multiple DataFrame by the sum of a certain value?

2 Answers2