pyPdf extracting info from IndirectObject

Question

I am writing a script that will read the creation and modified dates of pdf files. I am using pyPdf package in Python

I have the following code

from pyPdf import PdfFileWriter, PdfFileReader

input1 = PdfFileReader(file('myfile','rb'))

input1.input1.getDocumentInfo()

this code returns

{'/Producer': IndirectObject(185, 0), '/CreationDate': IndirectObject(186, 0), '/ModDate': IndirectObject(186, 0)}

I am not sure how to extract the information from these IndirectObject, any help would be appreciated!

score 2 · Answer 1 · edited May 12 '20 at 14:09

I don't really know pyPdf that well, but pdfrw has some similar functionality, and (IMHO -- I'm the author) a somewhat more simplistic interface. pdfrw maps structures in PDF files into Python structures. Here is an example session:

>>> from pdfrw import PdfReader
>>> x = PdfReader('some_random.pdf')

What is x? It's the trailer dictionary of the PDF file, which is mapped into a (subclassed) Python dictionary:

>>> list(x)
['/Size', '/Info', '/Root']

To access items in this dictionary, you could use dictionary-style lookup, but since all the standard Adobe names start with a slash and a letter, pdfrw supports attribute lookups as well for convenience. It's basically dictionaries and lists all the way down. Info is just another dictionary:

>>> x.Info
{'/ModDate': '(D:20130802052610)',
 '/Producer': '(ImageMagick 6.6.0-1 2010-03-04 Q8 http://www.imagemagick.org)',
 '/Title': '(US4441207.pdf)',
 '/CreationDate': '(D:20130802052610)'}

So you can pull out the Producer the same way:

>>> x.Info.Producer
'(ImageMagick 6.6.0-1 2010-03-04 Q8 http://www.imagemagick.org)'

The Producer's value is a PDF string -- that's what the parentheses are about. You can strip them with the decode() method:

>>> x.Info.Producer.decode()
'ImageMagick 6.6.0-1 2010-03-04 Q8 http://www.imagemagick.org'

Likewise with the CreationDate:

>>> x.Info.CreationDate.decode()
'D:20130802052610'

The pdfrw documentation isn't really great, but there are a lot of examples on github and a few here on SO.

pyPdf extracting info from IndirectObject

1 Answers1