0

I have a pandas data frame with large number of columns:

Col1 Col2 Col3 Col4
A1   B3   C2   D4
A2   B4   C3   D2
A5   B2   C1   D1
A4   B3   C2   D4
A5   B4   C3   D2
A2   B3   C4   D1
A3   B2   C3   D4
A1   B4   C3   D2
A4   B2   C1   D1

I want to split this data frame into n chunks so that each chunk has at least one row that belongs to all the unique values of col4.

For example: if n=3

df1:

Col1 Col2 Col3 Col4
A1   B3   C2   D4
A2   B4   C3   D2
A5   B2   C1   D1

df2:

Col1 Col2 Col3 Col4
A4   B3   C2   D4
A5   B4   C3   D2
A2   B3   C4   D1

df3:

Col1 Col2 Col3 Col4
A3   B2   C3   D4
A1   B4   C3   D2
A4   B2   C1   D1
msksantosh
  • 379
  • 4
  • 19
  • What should happen if `n=4`? Here there are not enough unique cases in `Col4`. – Little Bobby Tables Jan 12 '18 at 14:40
  • do you care about the original order of the rows? about the size of the chunks ? this looks like a very bespoke use case. – louis_guitton Jan 12 '18 at 14:44
  • @josh in that case, i would like to have maximum number of chunks to match the criteria. In my case, I have millions of rows, so that scenario is not possible – msksantosh Jan 12 '18 at 14:45
  • @laguittemh no, i do not care about the original order of the row, or the size of chunks – msksantosh Jan 12 '18 at 14:46
  • 1
    What are you ultimately trying to do with this chunking? There may be a simpler way of achieving your end goal. – Silenced Temporarily Jan 12 '18 at 14:57
  • Playing devil's advocate here, what if `n=3` and there are more than `n` unique entries for a group in `Col4`? Does it matter if you have 5 in one chunk and 1 in all the others? e.g. df1.Col4 = [D4, D4, D4, D2, D1], df2.Col4 = [D4, D2, D1]. – Little Bobby Tables Jan 12 '18 at 15:02
  • @A. Leistra: I am trying to build a random forest on each of the chunks with col4 as my dependent variable, and i am trying to have as much uniform representation of col4 in each of the chunks as possible – msksantosh Jan 12 '18 at 15:14
  • @josh I would prefer to have as many small chunks as possible, but if no other way, your solution is also fine – msksantosh Jan 12 '18 at 15:15

2 Answers2

1

It isn't pretty but it is a solution that covers your above stipulations, I believe.

We can split the DataFrames using split_array form numpy. See this post for more details on it.

import numpy as np

df = pd.DataFrame({'D': 'a a a b b b c c c c c c'.split(' '), 'A': range(12)})

n = 3

# here we count the nth position of the the 'D' categorical
df = df.assign(ranked=lambda df: df.groupby('D').transform(lambda x: x.rank()))

# we now filter into two dfs: one df with evenly distributed columns Ds
# and one with all the left overs where ranked > n
df1 = df.loc[lambda x: x.ranked<=n, :].sort_values('ranked')
df2 = df.loc[lambda x: x.ranked>n, :].sort_values('ranked')

# split them as mentioned before
df1_list = np.array_split(df1, n)
df2_list = np.array_split(df2, n)

#zip the lists of them back together and join df1 and df2 chunks
list_of_chunked_dfs = [pd.concat(df).drop('ranked', axis=1) for df in zip(df1_list, df2_list)]
Little Bobby Tables
  • 4,466
  • 4
  • 29
  • 46
1

Just need groupby + //

import numpy as np 
import pandas as pd 
d={}
n=3

for x, y in df.groupby(np.arange(len(df)) // n):
    d[x] = y

d
Out[625]: 
{0:   Col1 Col2 Col3 Col4
 0   A1   B3   C2   D4
 1   A2   B4   C3   D2
 2   A5   B2   C1   D1, 1:   Col1 Col2 Col3 Col4
 3   A4   B3   C2   D4
 4   A5   B4   C3   D2
 5   A2   B3   C4   D1, 2:   Col1 Col2 Col3 Col4
 6   A3   B2   C3   D4
 7   A1   B4   C3   D2
 8   A4   B2   C1   D1}

For example

d[0]
Out[626]: 
  Col1 Col2 Col3 Col4
0   A1   B3   C2   D4
1   A2   B4   C3   D2
2   A5   B2   C1   D1
BENY
  • 317,841
  • 20
  • 164
  • 234