2

I have a df as follows-

a   b   c
x   2   3
y   2   3
z   3   2
w   1   5
(upto thousands of records)

I want to group this dataframe based on b,c such that each group has only n number of rows. If there are any more rows in the same group, I want to create a new group. That is the main problem statement. I also want to delete these groups from the original dataframe if possible.

Sample output (with a little more explanation) -

I basically want to loop on the df and am currently using the following code-

for x,y in df.groupby(['b','c']):
    print(y)
With this code Im getting the following groups:
a   b   c
x   2   3
y   2   3

a   b   c
z   3   2

a   b   c
w   1   5
Now lets say I want only 1(n) row in each group, this is the output Im looking for:
a   b   c
x   2   3

a   b   c
y   2   3

a   b   c
z   3   2

a   b   c
w   1   5

(And maybe delete these groups from the df too if possible)

Thank you!

Pallav Doshi
  • 209
  • 2
  • 9

1 Answers1

1

Taken from the accepted answer here, I have modified the code for your question:

import pandas as pd

df = pd.DataFrame({"a": ["x", "y", "z", "w"],
                   "b": [2, 2, 3, 1],
                   "c": [3, 3, 2, 5]})

n = 1

for x, y in df.groupby(['b','c']):
    list_df = [y[i: i+n] for i in range(0, y.shape[0], n)]
    for i in list_df:
        print(i)

#a  b  c
#w  1  5
#
#a  b  c
#x  2  3
#
#a  b  c
#y  2  3
#
#a  b  c
#z  3  2

This splits the grouped dataframe for length of n rows. If you wanted to delete each group from the dataframe each time, you could add df.drop(i.index), which will delete the index values (as these are carried through):

for x,y in df.groupby(['b','c']):
    list_df = [y[i: i+n] for i in range(0, y.shape[0], n)]
    for i in list_df:
        print(i)
        df = df.drop(i.index)
        print(df)
Rawson
  • 2,637
  • 1
  • 5
  • 14
  • Thanks! This code does work, but is it suggested to drop rows from a df while looping on it? – Pallav Doshi Jul 04 '22 at 12:07
  • No problem. I would avoid dropping values in a loop if using `.iloc[]`, because results might be different to what you are expecting. If you are using specific, fixed values, then it isn't a problem. – Rawson Jul 07 '22 at 20:14