1

Currently i have a dataframe of a PDF file converted into a CSV file format, So the PDF consist of 4 pages and it is all coming in one data frame.

So my goal is to divide the data frame according to the page_num.

For Example:

page_num  word_num    left    top  width  text
1          1           322     14   14     My
1          2           304     4    41     Name
1          3           322     5    9      is
1          4           316     14   20     Raghav
2          1           420     129  34     Problem 
2          2           420     31   27     just
2          3           420     159  27     got
2          4           431     2    38     complicated
3          1           322     14   14     #40
3          2           304     4    41     @gmail.com   
3          1           420     129  34     2019 
3          2           420     31   27     January

SO using pandas library i wanted to split my dataframe(df) into 3 dataframes(df1, df2, df3).

Thanks!

2 Answers2

1

You can use groupby with operator.itemgetter:

from operator import itemgetter
df1, df2, df3 = map(itemgetter(1), df.groupby('page_num'))

Note groupby has sort=True by default, so you can assume this will filter by '1', '2', '3' in that order.

For an arbitrary number of dataframes, see Splitting dataframe into multiple dataframes: list or dict is more appropriate in this case.

jpp
  • 159,742
  • 34
  • 281
  • 339
0

You can use loc to access specific rows and/or columns

df1 = df.loc[df['page_num']  == 1]
df2 = df.loc[df['page_num']  == 2]
df3 = df.loc[df['page_num']  == 3]

Output:

   page_num  word_num  left  top  width    text
0         1         1   322   14     14      My
1         1         2   302    4     41    Name
2         1         3   322    5      9      is
3         1         4   316   14     20  Raghav
   page_num  word_num  left  top  width         text
4         2         1   420  129     34      Problem
5         2         2   420   31     27         just
6         2         3   420  159     27          got
7         2         4   431    2     38  complicated
    page_num  word_num  left  top  width         text
8          3         1   322   14     14          #40
9          3         2   304    4     41   @gmail.com
10         3         1   420  129     34         2019
11         3         2   420   31     27      January

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

n8-da-gr8
  • 541
  • 6
  • 20