split a dataframe

Question

print (df)

score 4 · Answer 1 · answered Oct 28 '19 at 12:39

4

You could do use pd.cut on the cumulative sum of the B column:

th = 50

# find the cumulative sum of B 
cumsum = df.B.cumsum()

# create the bins with spacing of th (threshold)
bins = list(range(0, cumsum.max() + 1, th))

# group by (split by) the bins
groups = pd.cut(cumsum, bins)

for key, group in df.groupby(groups):
    print(group)
    print()

Output

answered Oct 28 '19 at 12:39

Dani Mesejo

61,499
6
49
76

Hi Daniel do u know how can i give the groups different names to call them later – maher Oct 28 '19 at 15:00
You can put them in a dictionary – Dani Mesejo Oct 28 '19 at 15:03
@DanielMesejo you might be interested in the timings in my answer. Didn;t expect a `for loop` with `numba` to be that much faster. – Erfan Oct 28 '19 at 15:06
@Erfan Indeed is a lot faster, I guess numba is a really great tool – Dani Mesejo Oct 28 '19 at 15:07

Erfan · Accepted Answer · 2020-07-22T21:43:07.283

1

Here's a method using numba to speed up our for loop:

We check when our limit is reached and we reset the total count and we assign a new group:

from numba import njit

@njit
def cumsum_reset(array, limit):
    total = 0
    counter = 0 
    groups = np.empty(array.shape[0])
    for idx, i in enumerate(array):
        total += i
        if total >= limit or array[idx-1] == limit:
            counter += 1
            groups[idx] = counter
            total = 0
        else:
            groups[idx] = counter
    
    return groups

grps = cumsum_reset(df['B'].to_numpy(), 50)

for _, grp in df.groupby(grps):
    print(grp, '\n')

Output

Timings:

# create dataframe of 600k rows
dfbig = pd.concat([df]*100000, ignore_index=True)
dfbig.shape

(600000, 2)

# Erfan
%%timeit
cumsum_reset(dfbig['B'].to_numpy(), 50)

4.25 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Daniel Mesejo
def daniel_mesejo(th, column):
    cumsum = column.cumsum()
    bins = list(range(0, cumsum.max() + 1, th))
    groups = pd.cut(cumsum, bins)
    
    return groups

%%timeit
daniel_mesejo(50, dfbig['B'])

10.3 s ± 2.17 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion, the numba for loop is 24~ x faster.

edited Jul 22 '20 at 21:43

answered Oct 28 '19 at 12:59

Erfan

40,971
8
66
78

You can get a large speedup in your Numba function if you use a numpy array for `groups` instead of a list. – max9111 Oct 28 '19 at 16:42
I tried that, but got an error that `type of variable cannot be determined`. @max9111 – Erfan Oct 28 '19 at 16:58
It should be possible with allocating a array with `groups = np.empty(array.shape[0],dtype=np.uint64)` instead of `groups = []` and writing the result to the array using `groups[idx]=counter` instead of `groups.append(counter)`. – max9111 Oct 28 '19 at 17:02
I see, that worked indeed, will edit answer. I tried with `groups=np.array([])` and then `groups = np.append(groups, counter)`. That gave me an error. @max9111 – Erfan Oct 28 '19 at 17:07
Hi @Erfan I thanks for the answer but I will need to split the data always in 6 bin based on the threshold is that possible I tried to edit yout code but it does not work: if total >= 50 or array[idx-1] == 50 and goups==3: – maher Nov 12 '19 at 09:38

split a dataframe

2 Answers2

Linked