2

I have a dataframe which I have grouped based on a column, I have to then merge these "grouped dataframes" to newer dataframes but under the condition that the newer dataframe must not have more than x rows (3 in this case), if it exceeds the count, I create a new df (else part in my code). I think I have a code that does this, but this is slow on my actual dataset, 300000 rows in dataframe.

Code

import pandas as pd

a = [1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3]
df = pd.DataFrame({"a": a})

dfs = [pd.DataFrame()]

group = (df['a'] != df['a'].shift()).cumsum()

for i, j in df.groupby(group):
    curRow = j.shape[0]
    prevRow = dfs[-1].shape[0]
    # is the new size greater than 3
    if curRow + prevRow <= 3:
        # less than 3 so add to the preivous df
        dfs[-1] = dfs[-1].append(j)
    else:
        # greater than 3 so add this as a new df
        dfs.append(j)

Expected Output

"dfs" will have 25 dataframes but faster than current code

Current Output

"dfs" will have 25 dataframes

For those asking the logic of group by

It is basically a itertools.groupby when you give a sequence that is not sorted,

Make Pandas groupby act similarly to itertools groupby

Jake
  • 245
  • 1
  • 3
  • 13
  • can you explain the logic of the split? – mozway Nov 18 '21 at 14:45
  • @mozway the logic is basically what `itertools.groupby` does when you give a sequence that is not sorted, https://stackoverflow.com/questions/32683492/make-pandas-groupby-act-similarly-to-itertools-groupby – Jake Nov 18 '21 at 14:46
  • I am talking about the content of the loop, what do you want to do here? – mozway Nov 18 '21 at 14:48
  • @mozway each group in the group by must be joined to a newer dataframe, the condtion is that when I join a new group, the number of rows must not cross a certain limit, in this case `curRow + prevRow <= 3:` that is `3`, am I making sense? if the number of rows is more than the count (3) in this case, I will create a new dataframe, my else part – Jake Nov 18 '21 at 14:49
  • what is slow is likely the `append` part, you won't improve much the `groupby`, that's why I am asking about the real end goal – mozway Nov 18 '21 at 14:52
  • @mozway the end goal is I should be ending with 25 "broken down" dataframes formed from the different groups I get, I need to make sure when I merge these groups I do not cross the limit, if I am not clear, then I would gladly edit / provide more info, `dfs` is the end goal – Jake Nov 18 '21 at 14:56
  • @Jake. Your question sounds like multiple knapsack problem. – Corralien Nov 18 '21 at 15:07
  • @Corralien thanks for the mentioning that term, let me see if I can find anything I can make use of – Jake Nov 18 '21 at 15:10
  • I don't understand how it should result in 25 groups. If I understood it correctly, you want to first group by consecutive values of the 'a' column. Then, if a given group contains more than 3 rows you want to split it into two or more groups. Is that correct? However, according to your example `df.groupby(group)` results in 50 groups already, and you want to split them even more. Therefore, the number of final groups can't be lower than 50. – Rodalm Nov 18 '21 at 15:28
  • Nevermind, you want to merge consecutive groups, not split each even more. I misread the question. – Rodalm Nov 18 '21 at 15:32
  • 1
    @HarryPlotter yeah, merge consecutive groups into newer dfs with a constraint on the row limit for each new df formed – Jake Nov 18 '21 at 15:37

1 Answers1

1

You should never append to DataFrames, especially in a loop, it's very very slow. That is most likely the bottleneck.

One way to avoid that is to iterate over the old group sizes, and iteratively determine the new group labels of each row according to the constraint. Then pass those labels to DataFrame.groupby to form the new groups.

This method should be a lot faster for large DataFrames.

a = [1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3]
df = pd.DataFrame({"a": a})

group = (df['a'] != df['a'].shift()).cumsum()
old_group_sizes = df.groupby(group).size().tolist() 

# new group labels to pass to groupby (one for each row)
new_labels = []
# initialize the new group labels (integers)  
group_label = 0
# current group size 
group_size = 0

# iterate over the old groups's sizes 
for s in old_group_sizes:
    # increment the size of the current group
    group_size += s  
    if group_size > 3:     # if it exceeds the threshold 
        group_label += 1   # start a new group 
        group_size = s     # and update the new group size 
    # set the new group label for each row of the old group
    new_labels.extend([group_label]*s)

# create a list of the new groups/ DataFrames
dfs = [g for k, g in df.groupby(new_labels)]

Output

# number of new groups
>>> len(dfs)
25

# first group
>>> dfs[0]

   a
0  1
1  1
2  2

# second group
>>> dfs[1]

   a
3  3
4  2
5  2

# group label of each row 
>>> df['group num'] = df.groupby(new_labels).ngroup()
>>> df.head(20)

    a  group num
0   1          0
1   1          0
2   2          0
3   3          1
4   2          1
5   2          1
6   3          2
7   1          2
8   1          2
9   2          3
10  3          3
11  2          4
12  2          4
13  3          4
14  1          5
15  1          5
16  2          5
17  3          6
18  2          6
19  2          6

Performance Comparasion

I compared your solution with mine for a DataFrame of 300k rows.

Setup

Here is the setup used.

import pandas as pd
import numpy as np

# my solution
def create_new_labels(df):
    group = (df['a'] != df['a'].shift()).cumsum()
    old_group_sizes = df.groupby(group).size().tolist() 
    new_labels = []
    group_label = 0
    group_size = 0

    for s in old_group_sizes:
        group_size += s  
        if group_size > 3:    
            group_label += 1   
            group_size = s    
        new_labels.extend([group_label]*s)
            
    dfs = [g for k, g in df.groupby(new_labels)]
    return dfs

# OP solution
def append_rows(df):
    group = (df['a'] != df['a'].shift()).cumsum()
    dfs = [pd.DataFrame()]
    for i, j in df.groupby(group):
        curRow = j.shape[0]
        prevRow = dfs[-1].shape[0]
        if curRow + prevRow <= 3:
            dfs[-1] = dfs[-1].append(j)
        else:
            dfs.append(j)
    return dfs

n = 300_000

df = pd.DataFrame({'a': np.random.randint(1, 4, size=n)})

>>> df.shape

(300000, 1)

Results

Here are the results.

>>> %timeit append_rows(df)

50.1 s ± 996 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit create_new_labels(df)

2.92 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This means that on my machine, my solution is ~ 17 times faster.

Rodalm
  • 5,169
  • 5
  • 21
  • 1
    thanks for the solution, I will try this on my dataset and let you know, will mark this as accepted if it significantly decreases my time, do not have access to the data right now – Jake Nov 18 '21 at 16:31
  • the solution seems promising so far, so I have upvoted – Jake Nov 18 '21 at 16:32
  • 1
    thanks, this works faster than what I have shown – Jake Nov 18 '21 at 16:47
  • 1
    @Jake I'm running some tests, I will post them soon. It should be more >~ 10 times faster – Rodalm Nov 18 '21 at 16:47
  • can you tell me where you learnt pandas and to do stuff like this? – Jake Nov 18 '21 at 16:50
  • 1
    @Jake I updated the answer with a simple performance comparison between the two solutions. I highly recommend reading the official guides of the [docs](https://pandas.pydata.org/pandas-docs/version/1.1.5/), it's the best resource for learning `pandas`. Then it's just a matter of practising a lot! – Rodalm Nov 18 '21 at 17:27
  • 1
    I can see similar results in my run of your test too, and thanks for pointing to the docs, hopefully I can do these things on my own – Jake Nov 19 '21 at 01:39