You should never append to DataFrames, especially in a loop, it's very very slow. That is most likely the bottleneck.
One way to avoid that is to iterate over the old group sizes, and iteratively determine the new group labels of each row according to the constraint. Then pass those labels to DataFrame.groupby
to form the new groups.
This method should be a lot faster for large DataFrames.
a = [1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3]
df = pd.DataFrame({"a": a})
group = (df['a'] != df['a'].shift()).cumsum()
old_group_sizes = df.groupby(group).size().tolist()
# new group labels to pass to groupby (one for each row)
new_labels = []
# initialize the new group labels (integers)
group_label = 0
# current group size
group_size = 0
# iterate over the old groups's sizes
for s in old_group_sizes:
# increment the size of the current group
group_size += s
if group_size > 3: # if it exceeds the threshold
group_label += 1 # start a new group
group_size = s # and update the new group size
# set the new group label for each row of the old group
new_labels.extend([group_label]*s)
# create a list of the new groups/ DataFrames
dfs = [g for k, g in df.groupby(new_labels)]
Output
# number of new groups
>>> len(dfs)
25
# first group
>>> dfs[0]
a
0 1
1 1
2 2
# second group
>>> dfs[1]
a
3 3
4 2
5 2
# group label of each row
>>> df['group num'] = df.groupby(new_labels).ngroup()
>>> df.head(20)
a group num
0 1 0
1 1 0
2 2 0
3 3 1
4 2 1
5 2 1
6 3 2
7 1 2
8 1 2
9 2 3
10 3 3
11 2 4
12 2 4
13 3 4
14 1 5
15 1 5
16 2 5
17 3 6
18 2 6
19 2 6
Performance Comparasion
I compared your solution with mine for a DataFrame of 300k rows.
Setup
Here is the setup used.
import pandas as pd
import numpy as np
# my solution
def create_new_labels(df):
group = (df['a'] != df['a'].shift()).cumsum()
old_group_sizes = df.groupby(group).size().tolist()
new_labels = []
group_label = 0
group_size = 0
for s in old_group_sizes:
group_size += s
if group_size > 3:
group_label += 1
group_size = s
new_labels.extend([group_label]*s)
dfs = [g for k, g in df.groupby(new_labels)]
return dfs
# OP solution
def append_rows(df):
group = (df['a'] != df['a'].shift()).cumsum()
dfs = [pd.DataFrame()]
for i, j in df.groupby(group):
curRow = j.shape[0]
prevRow = dfs[-1].shape[0]
if curRow + prevRow <= 3:
dfs[-1] = dfs[-1].append(j)
else:
dfs.append(j)
return dfs
n = 300_000
df = pd.DataFrame({'a': np.random.randint(1, 4, size=n)})
>>> df.shape
(300000, 1)
Results
Here are the results.
>>> %timeit append_rows(df)
50.1 s ± 996 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit create_new_labels(df)
2.92 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This means that on my machine, my solution is ~ 17 times faster.