0

We get PDF files delivered to us daily and we need to get the images out. For example, what I want to do is to get the image back out of this PDF file I have, with python. Most pdf files we get are multipage and we want to export each embedded image to separate files. Most have jpeg files in them, but his one does not.

Object 5 is embedded as a zlib compressed stream. I am pretty sure it is zlib compressed because it is marked as FlateDecode and the start of the stream is \x78\x9c which is typical for zlib. You can see (part of) the hex dump here

The question is, how do I 'deflate' it and save the resulting file.

Thank you for sharing your wisdom.

DDecoene
  • 7,184
  • 7
  • 30
  • 43
  • Yes, we get dozens of pdf files per day with at least four pages. We need to automate the extraction, trust me. – DDecoene Mar 14 '17 at 18:18
  • Have you checked [this](http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python)? And googling gave me [this](http://mikelynchgames.com/software-development/using-wand-to-extract-pngs-from-pdfs/). – Sangbok Lee Mar 15 '17 at 00:24
  • Yes, when I try to use wand (easiest) my whole macbook crashes without a reason. And also, I'm not permitted to install extra libraries on the server where it will be deployed to anyway. So, I'm using [this code for now](https://gist.github.com/DDecoene/4e91449572a473b278ec887ce61238b5) to extract jpg files but images with "\x78\x9c" I don't know what to do. – DDecoene Mar 15 '17 at 12:24
  • Does the server have inkscape? It can be run in command line mode. – Patrick Maupin Mar 30 '17 at 21:01
  • No it does not and I cannot install iet either :( – DDecoene Apr 05 '17 at 08:46
  • You should look at [PDFFigures2](https://github.com/allenai/pdffigures2). It is implemented in scala, however, there is an earlier version [PDFFigures](https://github.com/allenai/pdffigures) of the same software which is implemented in python. – Samyak Jain Nov 21 '17 at 08:29

1 Answers1

0

I searched everywhere and tried many things but couldn't get to work. I managed to decompress the data like this:

import zlib
with open("MDL1703140088.pdf", "rb") as f:
    pdf = f.read()

image = zlib.decompress(pdf[640:69307])

640 is zlib header(b'x\x9c') position and 69307 is the position of something like footer of pdf spec. b'\nendstream\n' is there. Detail is in the spec and some helpful Q&A can be found here. But omitting the end position is allowed in this case because decompress() seems to ignore following non-compressed data. You can validate this by:

decomp = zlib.decompressobj()
image = decomp.decompress(pdf[640:])
print(decomp.unused_data)  # starts from b'\nendstream\n

So far so good. But when I write image to a PNG file, it cannot be read by any image viewer. Actually decompressed data looks so quite empty here and there. I attached some PNG header, but no luck. Hey, it's too much...

As I said earlier (strangely my comment was removed by someone), you'd better use some other existing tools. If Acrobat is not your option, what about pdftopng (part of Xpdf)? pdftopng MDL1703140088.pdf . gave me a valid PNG file flawlessly. Obviously command-line tools can be executed in Python, as you may know.

Community
  • 1
  • 1
Sangbok Lee
  • 2,132
  • 3
  • 15
  • 33
  • Using pdftopng is a good idea. For sure. But (there is always a but isn't there ;D) I cannot add libraries or tools on the server it will run. The server is not ours :( – DDecoene Mar 15 '17 at 18:13