1

I am using Camelot-py to read and extract attributes from several PDFs. I use table_areas to extract some of the attributes and I am facing difficulties in setting the correct areas, due to the deviation in X or Y co-ordinates between some of the forms. Some forms (Sample 1) have minimum whitespace at the top, while others (Sample 2) have more whitespace. This shifts the y-coordinates by about 10-15

Sample 1 enter image description here

Sample 2 enter image description here

Is there a way to crop or uniform them at runtime?

A.A. F
  • 349
  • 5
  • 16
  • I have found no direct way to do the same...what I do is either use `tabula.py` or convert pdf to text and then extract text and store it in excel how your table should look like!! – Rahul Agarwal Jan 28 '19 at 13:39
  • I do not have the liberty to completely change the approach at this time. I need a way I can use the existing script by cropping off the whitespace at the top – A.A. F Jan 28 '19 at 13:52
  • 1
    Upon my research, I have not found any solution to this!! I will follow this question to see if their is any answer to this problem – Rahul Agarwal Jan 28 '19 at 13:53

2 Answers2

0

I think the solution is using parameter table_regions, as specified in Find PDF Dimensions with Camelot.

Read more about table_regions in: https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-regions

0

For this functionality, you could use pdfCropMargins which crops the margins of PDF files. It is implemented as a command-line application, to call it from Python:

import subprocess

filename = "test.pdf"

cmd = f"pdf-crop-margins -v -s -u {filename}"

proc = subprocess.Popen(cmd.split())
proc.wait()

From the documentation:

That command prints verbose output, forces all pages to be the same size (-s) and then crops each page the same amount (-u) for a uniform appearance, retaining the default of 10% of the margins.

funnydman
  • 9,083
  • 4
  • 40
  • 55