Extracting tables spanning to multiple pages

Question

I am trying to extract table from pdf. Tabula helped me to extract tables from pdf.

Currently what issue I am facing is, if any table spanning to multiple pages, Tabula considers each new page table content as new table.

Is there any way or logic, to overcome this issue?

Code:

from tabula import read_pdf
df = read_pdf("SampleTableFormat2pages.pdf", multiple_tables=True, pages="all")
print len(df)
print df

output

2
[        0       1       2       3       4
0  Label1  Label2  Label3  Label4  Label5
1   Row11   Row12   Row13   Row14   Row15
2   Row21   Row22   Row23   Row24   Row25
3   Row31   Row32   Row33   Row34   Row35,        0      1      2      3      4
0  Row41  Row42  Row43  Row44  Row45
1  Row51  Row52  Row53  Row54  Row55]

Any logic to interpret Tabula to understand table boundry and next page spanning?

OR anyother library which can help on this?

I do believe it can be done, because if you do it using windows software you can read tables spanning multiple pages. I cannot provide help further than that, but there must be code for it! — d_kennetz, Sep 08 '18 at 11:18
There is no inbuilt solution from the library, but I believe this can be solved with Pandas concat ?! — ExtractTable.com, Oct 28 '19 at 17:26

score 7 · Answer 1 · answered Sep 26 '18 at 01:52

7

I will suggest going to each page at a time and concat the final table. You can use this function for the number of pages in the pdf

import re
def count_pdf_pages(file_path):
   rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
   with open(file_path, "rb") as temp_file:
   return len(rxcountpages.findall(temp_file.read()))

Now run the loop through each of the pages with the table

df=pd.DataFrame([])
df_combine=pd.DataFrame([])
for pageiter in range(pages):
            df = tabula.read_pdf("SampleTableFormat2pages.pdf",pages=pageiter+1, guess=False)
            #If you want to change the table by editing the columns you can do that here.
            df_combine=pd.concat([df,df_combine],) #again you can choose between merge or concat as per your need

answered Sep 26 '18 at 01:52

VoldyArrow

81
3

Can you please accept the answer if this satisfies your requirements? – VoldyArrow Mar 05 '20 at 15:09
how do you do it for password PDF files? – Kaviyarasu Arasu May 29 '21 at 07:47
You can open your password protected pdf by following the steps in this thread and then use the code above. https://stackoverflow.com/questions/26130032/open-a-protected-pdf-file-in-python @KaviyarasuArasu – VoldyArrow Dec 03 '21 at 14:03

Extracting tables spanning to multiple pages

1 Answers1