3

I'm trying to read in an example PDF with PDFrw. The PDF contains the phrase Hello Matthew in the bottom left corner at coordinates (100, 100). When I attempt to output the text (if I even can?) I get a stream of data. I can't seem to figure out how to get that as text.

>>> import pdfrw

>>> file_object = pdfrw.PdfReader("Hello.pdf")
>>> file_object
{'/ID': ['<f643bc0910dfb67725d53e11054f4609>', '<f643bc0910dfb67725d53e11054f4609>'], '/Info': (5, 0), '/Root': {'/Outl
ines': (8, 0), '/PageMode': '/UseNone', '/Pages': {'/Count': '1', '/Kids': [{'/Contents': (7, 0), '/MediaBox': ['0', '0
', '595.2756', '841.8898'], '/Parent': {...}, '/Resources': {'/Font': (1, 0), '/ProcSet': ['/PDF', '/Text', '/ImageB',
'/ImageC', '/ImageI']}, '/Rotate': '0', '/Trans': {}, '/Type': '/Page'}], '/Type': '/Pages'}, '/Type': '/Catalog'}, '/S
ize': '9'}

>>> file_object.pages[0]
{'/Contents': (7, 0), '/MediaBox': ['0', '0', '595.2756', '841.8898'], '/Parent': {'/Count': '1', '/Kids': [{...}], '/T
ype': '/Pages'}, '/Resources': {'/Font': (1, 0), '/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI']}, '/Rot
ate': '0', '/Trans': {}, '/Type': '/Page'}

>>> file_object.pages[0].keys()
['/Contents', '/MediaBox', '/Parent', '/Resources', '/Rotate', '/Trans', '/Type']

>>> file_object.pages[0].Contents
{'/Filter': ['/ASCII85Decode', '/FlateDecode'], '/Length': '102'}

>>> file_object.pages[0].Contents.stream
'GapQh0E=F,0U\\H3T\\pNYT^QKk?tc>IP,;W#U1^23ihPEM_?CW4KISi90EC-p>QkRte=<%V"lI7]P)Rn29neZ[Kb,htEWn&q7Q2"V~>'
Matthew
  • 837
  • 3
  • 18
  • 33

2 Answers2

4

That stream is compressed. You can tell that by the dictionary /Filter parameter.

Unfortunately, pdfrw does not (yet?) know how to decompress with that type of filter. If you run your pdf through something like pdftk first to decompress it, you might see something more reasonable.

Disclaimer: I am the primary pdfrw author.

But...

Even then, especially for non-ASCII fonts, character to glyph mapping in PDFs is complicated, so you won't always see something that looks reasonable.

If you really want to deeply examine text PDF files, pdfminer might be more useful -- pdfrw has not yet really grown the tools to do that too well.

Patrick Maupin
  • 8,024
  • 2
  • 23
  • 42
  • 1
    I just came to this conclusion when I noticed it was compressed! Totally agree, non-ASCII fonts could be chaos. My end result is I'm trying to modify the tags of a PDF and change them from `h1` to `h4` or add annotations. I don't think that's possibly after I read over your source for `pdfrw` though right? What would you suggest Patrick? (btw, I saw you were online 30 minutes ago and appreciate your quick response and thorough explanation!) – Matthew Mar 30 '17 at 19:47
  • 1
    Actually, annotations are outside the content stream, so you can do a lot of modification of annotations without modifying (or even decompressing) the content stream. pdfrw can be useful for this, but I haven't written any annotation-specific code. – Patrick Maupin Mar 30 '17 at 19:51
  • 1
    Hmm. I'll play with it more tomorrow morning and see if I can build out support in a PR on GH for annotations if it isn't too hard. – Matthew Mar 30 '17 at 19:54
  • 1
    That would be awesome, but it's really hard to figure out what's small and general purpose enough. PRs for example programs are always welcome, but I'm trying to curate the core library to keep it from collapsing under its own weight as pyPdf2 seems to have done. It literally took me years to add the pagemerge module... – Patrick Maupin Mar 30 '17 at 20:01
  • Have you managed to demonstrate how you could add or edit annotations? (being outside the compressed content stream sounds very useful). – Soferio Apr 16 '18 at 11:51
0

If your filter is only /Flatedecode or you can find an ASCII85Decode filter to run first (they must be run in order). I have been using pdfrw.uncompress.uncompress(page.Contents) to decode /Flatedecode streams (not the sames as PdfReader.uncompress(), the method does not pass a stream to the processing function, it gives it all of the indirect_objects).

>>> pdf = pdfrw.PdfReader('foo.pdf')
>>> pages = pdf.Root.Pages.Kids
>>> p1 = pages[0]
>>> p1.Contents
{'/Filter': '/FlateDecode', '/Length': '13679'}
>>> p1.Contents.stream[:30]
'x\x9cÕ}Ý\x92æ¶\x91å½"ô\x0eu5Q߬ëk\x02üßP8BRwË'
>>> pdfrw.uncompress.uncompress([p1.Contents]) # Contents object/s in a list.
True # it returns True even if the stream is not decoded.
>>> p1.Contents.stream[:30]
'/Artifact <</Attached [/Top]/T' # ready for parsing

Then search for lines ending in either 'TJ' or 'Tj' and take any values inside round brackets from those lines... and you have your text.

If you need location information for the text then find blocks of lines between BT and ET. Then check the line endings, if you have Tm it should be an array of 6 values [1,0,0,1,x,y] the last two numbers give you the bottom left corner of the text starting position.

Eddie
  • 140
  • 1
  • 7