0

I have a pandas DataFrame that I am grouping by columns ['client', 'product', 'data'].

grouped_data = raw_data.groupby(['client', 'product', 'data'])
print(len(grouped_data))
# 10000

I want to split the resulting groupby object into two chunks, one containing roughly 80% of the groups, the other one containing the rest.

I have been banging my head against the screen for some time now...

Joseph Tura
  • 6,290
  • 8
  • 47
  • 73

2 Answers2

3

By using np.split

df['key']=df[['client', 'product', 'data']].apply(tuple,1)

g1,g2=np.split(df['key'].unique(),[2000])

df1=df[df['key'].isin(g1)]

df2=df[df['key'].isin(g2)]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This is nice, though I prefer `df.set_index([...]).index` for defining groups. I *think* it's faster than `apply` + `tuple`. – jpp Oct 15 '18 at 15:09
  • @jpp I am also think that way , maybe using `list + map` can enhance the speed – BENY Oct 15 '18 at 15:10
  • Your proposed solution works a treat. How would that other version look roughly? – Joseph Tura Oct 16 '18 at 07:22
0

You could do something along the lines of:

grouped = df.groupby('Client')

bound = int(np.ceil(len(grouped)*0.8))-1

chunk1 = [g[1] for g in list(grouped)[:bound]]
chunk2 = [g[1] for g in list(grouped)[bound:]]

For the following sample dataframe:

     Client   Product   Data
0   Client1  ProductA  Data1
1   Client2  ProductA  Data3
2   Client3  ProductB  Data1
3   Client4  ProductA  Data2
4   Client5  ProductB  Data1
5   Client2  ProductA  Data1
6   Client3  ProductA  Data3
7   Client2  ProductB  Data1
8   Client3  ProductB  Data1
9   Client5  ProductA  Data2
10  Client1  ProductA  Data1
11  Client1  ProductB  Data1
12  Client4  ProductA  Data2
13  Client3  ProductB  Data2
14  Client2  ProductB  Data3

chunk1 would yield:

     Client   Product   Data
0   Client1  ProductA  Data1
10  Client1  ProductA  Data1
11  Client1  ProductB  Data1

     Client   Product   Data
1   Client2  ProductA  Data3
5   Client2  ProductA  Data1
7   Client2  ProductB  Data1
14  Client2  ProductB  Data3

     Client   Product   Data
2   Client3  ProductB  Data1
6   Client3  ProductA  Data3
8   Client3  ProductB  Data1
13  Client3  ProductB  Data2

And chunk2 would yield:

     Client   Product   Data
3   Client4  ProductA  Data2
12  Client4  ProductA  Data2

    Client   Product   Data
4  Client5  ProductB  Data1
9  Client5  ProductA  Data2
rahlf23
  • 8,869
  • 4
  • 24
  • 54