Remove whitespace from PDF Document

Question

I am using Camelot-py to read and extract attributes from several PDFs. I use table_areas to extract some of the attributes and I am facing difficulties in setting the correct areas, due to the deviation in X or Y co-ordinates between some of the forms. Some forms (Sample 1) have minimum whitespace at the top, while others (Sample 2) have more whitespace. This shifts the y-coordinates by about 10-15

Sample 1

Sample 2

Is there a way to crop or uniform them at runtime?

I have found no direct way to do the same...what I do is either use `tabula.py` or convert pdf to text and then extract text and store it in excel how your table should look like!! — Rahul Agarwal, Jan 28 '19 at 13:39
I do not have the liberty to completely change the approach at this time. I need a way I can use the existing script by cropping off the whitespace at the top — A.A. F, Jan 28 '19 at 13:52
Upon my research, I have not found any solution to this!! I will follow this question to see if their is any answer to this problem — Rahul Agarwal, Jan 28 '19 at 13:53

score 0 · Answer 1 · answered Jan 29 '19 at 09:20

0

I think the solution is using parameter table_regions, as specified in Find PDF Dimensions with Camelot.

Read more about table_regions in: https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-regions

answered Jan 29 '19 at 09:20

Stefano Fiorucci - anakin87

3,143
7
26

score 0 · Answer 2 · answered Dec 08 '19 at 11:15

For this functionality, you could use pdfCropMargins which crops the margins of PDF files. It is implemented as a command-line application, to call it from Python:

import subprocess

filename = "test.pdf"

cmd = f"pdf-crop-margins -v -s -u {filename}"

proc = subprocess.Popen(cmd.split())
proc.wait()

From the documentation:

That command prints verbose output, forces all pages to be the same size (-s) and then crops each page the same amount (-u) for a uniform appearance, retaining the default of 10% of the margins.

Remove whitespace from PDF Document

2 Answers2