0

I have a dataframe (df) with about 300 rows. The column names are 'Description', 'Impact' and 'lower_desc':

    Description                                       Impact    lower_desc
0   BICC's mission in its current phase extends th...   BAD [bicc's, mission, current, phase, extends, pre...
1   Narrative Impact Report\r\n\r\nDuring the cour...   GOOD    [narrative, impact, report, course, project, (...
2   Our findings have been used by social psycholo...   BAD [findings, used, social, psychologists, intere...
3   The data set has been used for secondary analy...   BAD [data, set, used, secondary, analysis, byt, es...
4   So far it seems that our research outcome has ...   BAD [far, seems, outcome, 'used', people, (educati...
5   Our findings on the effects of urbanisation on...   BAD [findings, effects, urbanisation, cognition, r...
6   The research findings have been used by a rang...   GOOD    [findings, used, range, societal, bodies,, inc...
7   In the last year we have disseminated the rese...   BAD [last, year, disseminated, five, different, wo...
8   \r\nThis research has been concerned with how ...   BAD [concerned, people, withhold, actions,, brain,...
9   The Centre has run a varied programme of cours...   BAD [centre, run, varied, programme, courses,, mas...
10  We presented evidence at one of the seminars o...   BAD [presented, evidence, one, seminars, additiona.
...

I am producing a training and test set, so I want to split the dataframe into two i.e. the first 200 rows go into df1 and the remaining 100 go into df2. There may be more than 300 rows or less.

How would one go about this?

Dillon
  • 997
  • 4
  • 13
Nicholas
  • 3,517
  • 13
  • 47
  • 86

2 Answers2

4

This will allocate the first 200 rows into df1 then anything after row 200 into df2:

df1 = df.iloc[:200]
df2 = df.iloc[200:]

If you want to stop at row 300 do this instead:

df2 = df.iloc[200:300]

You might want to reset the index on df2 to avoid it starting from 200. You can do:

df2 = df.iloc[200:300].reset_index(drop=True)
Dillon
  • 997
  • 4
  • 13
1
import pandas as pd                                                                           
                                                                                              
src = "/path/to/your/data/data.csv"                                                    
df = pd.read_csv(src, sep="\t")                                                               

# Cast to int to avoid decimals
half_len = int(len(df) / 2)                                                                        
                                                                                              
# Retrieve the first half of dataframe                                                        
df_one = df.iloc[:half_len]                                                                   
                                                                                              
#       Description                                       Impact    lower_desc                
# 0   BICC's mission in its current phase extend...                                           
# 1   Narrative Impact Report\r\n\r\nDuring the ...                                           
# 2   Our findings have been used by social psyc...                                           
# 3   The data set has been used for secondary a...                                           
# 4   So far it seems that our research outcome ...                                           
# Retrieve the other part of dataframe                                                        
df_two = df.iloc[half_len:]                                                                   
                                                                                              
#        Description                                       Impact    lower_desc               
# 5   Our findings on the effects of urbanisatio...                                           
# 6   The research findings have been used by a ...                                           
# 7   In the last year we have disseminated the ...                                           
# 8   \r\nThis research has been concerned with ...                                           
# 9   The Centre has run a varied programme of c...                                           
# 10  We presented evidence at one of the semina...                                           
alvaro nortes
  • 570
  • 4
  • 10
  • Ahh, thank you. This is perfect, as I was wondering how to do halfs (or thirds etc) – Nicholas Jun 26 '18 at 10:44
  • 1
    @ScoutEU If you want to split it into n equal parts you can use `np.array_split(df, n)`. See [this page](https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe) – Dillon Jun 26 '18 at 10:49
  • 1
    @Alvaro-nortes You should note that this will only work where `half_len` or any other `third_len` etc. is an integer result (whole number) and will fail otherwise. Even `df.iloc[:150.0]` will fail since `150.0` is a float and pandas expects an int – Dillon Jun 26 '18 at 10:52
  • Thank you Dillon! – Nicholas Jun 26 '18 at 11:02
  • Thank's @dillion I have modified the answer – alvaro nortes Nov 07 '21 at 18:39