Best practice to read pdf into python

Question

I'm trying to read a pdf document ( I removed some content cause of sensitive data: https://ufile.io/bgghw ) into python. I have to work with the check boxes and perform action's based on these and other text.

I tried PyPDF3 but it only gave corrupted output, after a little research I found pdfminer which sounds promising with the downside to use python 2.7.

I'm not sure if there are other package's or there is like a best practise for working with pdf's in python as all the information I got is several years old and most of the information is very contrary. Of course I could settle with the best package for my case :)

Thanks for any advice!

Look here: https://stackoverflow.com/q/32667398/10300416 – Nick Dima Dec 26 '18 at 19:49 — Nick Dima, Dec 26 '18 at 19:49

score 6 · Answer 1 · edited Dec 26 '22 at 07:59

6

First Option : pypdf

First run this in cmd to install pypdf: (may work better than PyPDF3 which you already tried)

pip install pypdf

Then to extract text from a pdf file use the following code:

# importing required modules
import pypdf

# creating a pdf reader object
reader = pypdf.PdfReader("example.pdf")

# printing number of pages in pdf file
print(len(reader.pages))

# creating a page object
page = reader.pages[0]

# extracting text from page
print(page.extract_text())

2nd Option : Textract

Run this in cmd to install textract

pip install textract

Then to read a pdf use the following code:

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

Good luck!

edited Dec 26 '22 at 07:59

Martin Thoma

124,992
159
614
958

answered Dec 26 '18 at 21:10

Gabriel Wolf

91
4

does extract support encrypted pdf? – Irshu Apr 14 '19 at 08:22
pypdf supports encrypted pdfs: https://pypdf2.readthedocs.io/en/latest/user/encryption-decryption.html – Martin Thoma Dec 26 '22 at 07:59

Best practice to read pdf into python

1 Answers1