1

I'm trying to read a pdf document ( I removed some content cause of sensitive data: https://ufile.io/bgghw ) into python. I have to work with the check boxes and perform action's based on these and other text.

I tried PyPDF3 but it only gave corrupted output, after a little research I found pdfminer which sounds promising with the downside to use python 2.7.

I'm not sure if there are other package's or there is like a best practise for working with pdf's in python as all the information I got is several years old and most of the information is very contrary. Of course I could settle with the best package for my case :)

Thanks for any advice!

Sebastian
  • 2,430
  • 4
  • 23
  • 40

1 Answers1

6

First Option : pypdf

First run this in cmd to install pypdf: (may work better than PyPDF3 which you already tried)

pip install pypdf

Then to extract text from a pdf file use the following code:

# importing required modules
import pypdf

# creating a pdf reader object
reader = pypdf.PdfReader("example.pdf")

# printing number of pages in pdf file
print(len(reader.pages))

# creating a page object
page = reader.pages[0]

# extracting text from page
print(page.extract_text())

2nd Option : Textract

Run this in cmd to install textract

pip install textract

Then to read a pdf use the following code:

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

Good luck!

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958