pdfparser from pdfminer: PDFException: PDFDocument is not initialized

Question

I'm not understanding this error. I want to open a pdf and loop over the pages but I'm getting this exception and I couldn't find much by googling it.

Here is the example that fails

from pdfminer.pdfparser import PDFParser, PDFDocument
from os.path import basename, splitext

file = 'tmpfiles/tmpfile.pdf'
filename = splitext(basename(file))[0]
fp = open(file, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
num_page = 0
text = ""
pages = doc.get_pages()
for p in pages:
    print("do whatever")

Here is the traceback

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    for p in pages:
  File "/home/.../anaconda3/lib/python3.6/site-packages/pdfminer/pdfparser.py", line 544, in get_pages
    raise PDFException('PDFDocument is not initialized')
pdfminer.pdftypes.PDFException: PDFDocument is not initialized

I have python 3.6

Before doing this I'm saving the pdf file like this because I have the contents in a base64 encoded string

decoded = base64.b64decode(content_string)
with open(tmpfiles_path+'tmpfile.pdf', 'wb') as fout:
     fout.write(decoded)

Could it be that the file is being saved with some protection?

Judging by the docs (https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf) it looks like there was a change in 2014 to make calling `initialize` unnecessary, but perhaps you're running an old version or something? Try calling `doc.initialize()` after `doc = PDFDocument(parser)` and see if that works? — Ashley Davies, Feb 08 '19 at 17:06
I did doc = doc.initialize() and now it says NoneType object has no attribute get_pages. If I only do doc.initialize() I get the same error as before — Atirag, Feb 08 '19 at 17:10
Yes, I believe it's the latter -- initialize likely initializes the object, rather than creating a new one. Have you tried this with a PDF which definitely has no protections or strange attributes as a sanity-check? If not, try running it with something like https://cloud.google.com/files/CloudStorage.pdf? — Ashley Davies, Feb 08 '19 at 17:12
ok apparently I had a weird versionof pdfminer because I just intalled pdfminer.six and changed a few lines of code and now it works. Thanks for the help! — Atirag, Feb 08 '19 at 17:37
No problem! Consider posting your solution as an answer below and accepting it -- you get some points & it helps people who come across this post later — Ashley Davies, Feb 08 '19 at 18:02

score 2 · Accepted Answer · answered Feb 08 '19 at 18:26

The problem was the version of pdfminer I was using. By installing pdfminer.six and changing the code in this way

from pdfminer.pdfpage import PDFPage

file = 'tmpfiles/tmpfile.pdf'
fp = open(file, 'rb')
pages = PDFPage.get_pages(fp)
for p in pages:
    print("do whatever")

Now it works.

pdfparser from pdfminer: PDFException: PDFDocument is not initialized

1 Answers1