0

Say I have many similar pdf files as the one from here:

I woudld like to extract the following table and save as excel file:

enter image description here

I'm able to do extract table and save excel file manually with package excalibur.

After installing Excalibur with pip3, I initialize the metadata database using:

$ excalibur initdb

And then start the webserver using:

$ excalibur webserver

Then go to http://localhost:5000 and start extracting tabular data from PDFs.

I wonder if it's possible to automatically do that with python script for multiple pdf files with packages such as excalibur-py, camelot, pdfminer, etc, since the size and position of table are fixed for same city's reports.

You may download other report files from this link.

Many thanks at advance.

ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

1

Using Camelot, you can build a pipeline like this:

import camelot

files_list=['FIRST_PATH','SECOND_PATH',...]
regions=['REGION_COORDINATES_1', 'REGION_COORDINATES_2',...]

for filepath in files_list:
    tables=camelot.read_pdf(filepath, pages='1-end', table_regions=regions)
    tables.export('tables.xls', f='excel')

table_regions parameter should be used when you know the approximate position of the table inside the page; if you know the exact position of the table, you should use table_areas.

You can read more about these parameters and other topics in the Camelot documentation.

  • Thanks, may I ask how could I find the table regions from pdf files? – ah bon Apr 13 '21 at 13:33
  • 1
    You can use visual debugging (https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging). Otherwise, if you have yet extracted tables, you can get coordinates by `table._bbox`. – Stefano Fiorucci - anakin87 Apr 13 '21 at 13:36