Extracting String Data from PDF Multi-Page Columns with Python

Asked Sep 02 '18 at 05:04

Active Sep 02 '18 at 05:04

Viewed 526 times

I have some PDFs that are organised into columns that I need to scrape. The problem is that each column is multi-page and isn't in the typical layout for columns, for example:

******Column 1******************Column 2*************

Sombody once told me Finger and her thumb The world was gonna In the shape of an "L" Roll me. I ain't the On her forehead. Well *******************NEXT PAGE************************** Sharpest tool in the The years start coming Shed. She was looking And they don't stop coming Kind of dumb with her

I have tried using standard PDF scrapers like PDFMiner but it will just return a string that reads like:

Sombody once told me
The world was gonna
Roll me. I ain't the
Finger and her thumb

Any help would be appreciated!

asked Sep 02 '18 at 05:04

Tylerr

1

You can try tabula (https://github.com/chezou/tabula-py) for table extraction. Also this discussion can be helpful: https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python/47719296 – Vladimir Poghosyan Sep 02 '18 at 12:10

Extracting String Data from PDF Multi-Page Columns with Python

0 Answers0