4

I need to extract tables from pdf, these tables can be of any type, multiple headers, vertical headers, horizontal header etc.

I have implemented the basic use cases for both and found tabula doing a bit better than camelot still not able to detect all tables perfectly, and I am not sure whether it will work for all kinds or not.

So seeking suggestions from experts who have implemented similar use case.

Example PDFs: PDF1 PDF2 PDF3

Tabula Implementation:

import tabula
tab = tabula.read_pdf('pdfs/PDF1.pdf', pages='all')
for t in tab:
    print(t, "\n=========================\n")

Camelot Implementation:

import camelot
tables = camelot.read_pdf('pdfs/PDF1.pdf', pages='all', split_text=True)
tables
for tabs in tables:
    print(tabs.df, "\n=================================\n")
Niranjan Kumar
  • 1,438
  • 1
  • 12
  • 29
  • 3
    *"still not able to detect all tables perfectly"* - it is extremely unlikely that there will ever be a software *detecting all tables perfectly*. – mkl Apr 23 '20 at 13:52
  • @Niranjan Kumar: Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 – 123456 Jul 02 '22 at 07:48

2 Answers2

9

Please read this: https://camelot-py.readthedocs.io/en/master/#why-camelot

The main advantage of Camelot is that this library is rich in parameters, through which you can improve the extraction.

Obviously, the application of these parameters requires some study and various attempts.

Here you can find comparision of Camelot with other PDF Table Extraction libraries.

2

I think Camelot better extracts data in a clean format and not jumbled up ( i.e. data retains the information and row contents are not affected). So, The quality of data extracted is better in case of difference in the number of lines per cells . ->Tabula requires a Java Runtime Environment

There are open (Tabula, pdf-table-extract) source (smallpdf, PDFTables) tools that are widely used to extract tables from PDF files. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table. Camelot was created to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!

Yash Darji
  • 21
  • 1