what python library should we use to extract table with complex headers from a PDF?

Question

I tried to use many libraries to extract table from PDF like : camelot , tabula , PDFPlumber , PDFTabExtract ... but they don't give a good result. The main problem is that headers are in complex format , and i have different format of headers .

with camelot i can't have a script that works for all pages in my PDF. with Tabula i got a confusing dataframe when the table has a rotated text header . with PDFPlumber i got problem with Stream Table (it works good only for Lattice table) and whith PDFTabExtract i got a problem when the text is rotated , it ignore it.

is there any solution whith which i can convert any table in my pdf that has different format ? i know that i can't find a generic solution , but atleast something that give a decent result .

Should i work with OCR ? what would you recommend ?

I really appreciate any result . thank you in advance .

score 0 · Answer 1 · answered Jun 10 '19 at 15:15

PDF does not have a dedicated method to describe tables. Tables are built by manipulating the distance between chunks of text. Text extraction of tables from PDF are based on identifying a table like structure by analyzing those distances.

Since the detection is not deterministic (i.e. like a table in a docx file), each solution that you mentioned has its own heuristics on how to detect tables and text. Each method with its pros and cons. A complex table such as the one you gave as an example is bound to yield poor results from most or all PDF text extractors.

OCR will likely identify the table in a similar manner and give similar results.

what python library should we use to extract table with complex headers from a PDF?

1 Answers1