I don't really know pyPdf that well, but pdfrw has some similar functionality, and (IMHO -- I'm the author) a somewhat more simplistic interface. pdfrw maps structures in PDF files into Python structures. Here is an example session:
>>> from pdfrw import PdfReader
>>> x = PdfReader('some_random.pdf')
What is x
? It's the trailer dictionary of the PDF file, which is mapped into a (subclassed) Python dictionary:
>>> list(x)
['/Size', '/Info', '/Root']
To access items in this dictionary, you could use dictionary-style lookup, but since all the standard Adobe names start with a slash and a letter, pdfrw supports attribute lookups as well for convenience. It's basically dictionaries and lists all the way down. Info is just another dictionary:
>>> x.Info
{'/ModDate': '(D:20130802052610)',
'/Producer': '(ImageMagick 6.6.0-1 2010-03-04 Q8 http://www.imagemagick.org)',
'/Title': '(US4441207.pdf)',
'/CreationDate': '(D:20130802052610)'}
So you can pull out the Producer the same way:
>>> x.Info.Producer
'(ImageMagick 6.6.0-1 2010-03-04 Q8 http://www.imagemagick.org)'
The Producer's value is a PDF string -- that's what the parentheses are about. You can strip them with the decode()
method:
>>> x.Info.Producer.decode()
'ImageMagick 6.6.0-1 2010-03-04 Q8 http://www.imagemagick.org'
Likewise with the CreationDate:
>>> x.Info.CreationDate.decode()
'D:20130802052610'
The pdfrw documentation isn't really great, but there are a lot of examples on github and a few here on SO.