How to split a DataFrame in pandas in predefined percentages?

Question

I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments.

For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment.

How would I achieve that?

related: https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test remove the `.sample` or `random` step and it's the same solution — EdChum, May 04 '17 at 08:20

score 27 · Accepted Answer · answered May 04 '17 at 08:11

Use numpy.split:

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
0  0.543405  0.278369  0.424518  0.844776  0.004719
1  0.121569  0.670749  0.825853  0.136707  0.575093
2  0.891322  0.209202  0.185328  0.108377  0.219697
3  0.978624  0.811683  0.171941  0.816225  0.274074

print (b)
          A         B         C         D         E
4  0.431704  0.940030  0.817649  0.336112  0.175410
5  0.372832  0.005689  0.252426  0.795663  0.015255
6  0.598843  0.603805  0.105148  0.381943  0.036476
7  0.890412  0.980921  0.059942  0.890546  0.576901
8  0.742480  0.630184  0.581842  0.020439  0.210027
9  0.544685  0.769115  0.250695  0.285896  0.852395

print (c)
           A         B         C         D         E
10  0.975006  0.884853  0.359508  0.598859  0.354796
11  0.340190  0.178081  0.237694  0.044862  0.505431
12  0.376252  0.592805  0.629942  0.142600  0.933841
13  0.946380  0.602297  0.387766  0.363188  0.204345
14  0.276765  0.246536  0.173608  0.966610  0.957013
15  0.597974  0.731301  0.340385  0.092056  0.463498
16  0.508699  0.088460  0.528035  0.992158  0.395036
17  0.335596  0.805451  0.754349  0.313066  0.634037
18  0.540405  0.296794  0.110788  0.312640  0.456979
19  0.658940  0.254258  0.641101  0.200124  0.657625

Why is this question not a dupe of this: https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test? — EdChum, May 04 '17 at 08:17
Because there is randomize, this solution not. But is similar. — jezrael, May 04 '17 at 08:17
I'd still say this is a dupe certainly related, the removal of the randomisation step is trivial IMO — EdChum, May 04 '17 at 08:20
This works like a charm. It is similar to the other question you are mentioning, but without the randomization part. — Dimitris P., May 04 '17 at 08:26

score 7 · Answer 2 · answered Jul 30 '21 at 11:19

7

Creating a dataframe with 70% values of original dataframe
part_1 = df.sample(frac = 0.7)
Creating dataframe with rest of the 30% values
part_2 = df.drop(part_1.index)

answered Jul 30 '21 at 11:19

Loich

813
11
18

It's worth noting that the arrays created by this solution are unpredictable (i.e. the values chosen by `sample` may differ in each run). This may be a plus or a minus, depending on the use case. – Luiz Martins Sep 29 '21 at 04:55
1

You can specify random_state, either int, array-like, or BitGenerator. This way you can get the same split each time. – Skulas Jan 23 '22 at 11:39

Gal Fridman · Answer 3 · 2019-07-23T14:17:50.167

I've written a simple function that does the job.

Maybe that might help you.

P.S:

Sum of fractions must be 1.

It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])

np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))

def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
    assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
    remain = df.index.copy().to_frame()
    res = []
    for i in range(len(fracs)):
        fractions_sum=sum(fracs[i:])
        frac = fracs[i]/fractions_sum
        idxs = remain.sample(frac=frac, random_state=random_state).index
        remain=remain.drop(idxs)
        res.append(idxs)
    return [df.loc[idxs] for idxs in res]

train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]

print(train.shape, test.shape, val.shape)

outputs:

(79, 4) (10, 4) (10, 4)

Does this maintain the order of data ? And split sequentially ? Ie Is the val table data that appeared after the test tabel?? Thank you — rex, Apr 05 '20 at 10:53

How to split a DataFrame in pandas in predefined percentages?

3 Answers3

Linked