1

I am trying to extract images from a PDF document using this specific library: pypdfium2 (https://pypi.org/project/pypdfium2/).

I would love to use PyMuPDF instead (given it's excellent speed and versatility), but because it uses a copy-left license I CANNOT use it for my workflow. So please don't provide an answer that advises me to use PyMuPDF.

Any suggestions are appreciated. I've looked through the docs but can't seem to find any image extraction methods.

To be clear, I am not trying to convert the PDF pages into images, I am trying to extract images within the document itself (assuming there are any). Images are typically embedded as either jpeg's or png's.

3 Answers3

0

PDF generally uses two types of means to store images, One is to take the raw image and embed it. Those are usually jpg and tend to use one type of compression. there are several methods like inline and indirect but the point is they are "as inserted".

Thus they will not change compression or quality, unless extracted, recompressed and re-inserted. A Question that many people ask is why cant I compress PDF images in place !, possible but tricky.

The other way is the RGB or GREY or MONO components are inserted as bitmaps (of one type or another) and for PNG (or those with Alpha Transparency) a second image is added as a SoftMask. Thus now 2 images per insertion. These are even harder to handle.

So easy FOSS solutions are hard to come by.

PDFImages -list will give you clues as to some structures and extract what it can (not all)

e.g.

--0000.ppm: page=1 width=1800 height=682 hdpi=599.67 vdpi=599.12 colorspace=DeviceRGB bpc=8
--0001.ppm: page=3 width=1834 height=665 hdpi=345.93 vdpi=345.75 colorspace=DeviceRGB bpc=8

so what images are those ? the first is 22 colours of near black and near white thus greyscale but almost monochrome in nature, could be converted externally to 600 dpi black and white !

The second is a screenshot from Amazon showing an I Phone so a high proportion of Orange and Black with some Red and Blue too, thus that can be converted into a JPEG or PNG (without alpha), at 346 dpi.xxx as whichever you wish ! enter image description here

And so on. In this case the majority are better candidates for lossless PNG, than that second one which alone would best be output as if it were a JPEG.

Basically reversing PDF raw image inputs is not simple for deciding what to output.

Untested

but try $ pypdfium2 extract-images --help to see its built in options are (I understand from docs --render should help)

K J
  • 8,045
  • 3
  • 14
  • 36
0

I'm the author of pypdfium2 and found this thread by chance. Yes, this is possible, and also documented. Take a look at PdfPage.get_objects() and PdfImage.extract() (or PdfImage.get_bitmap()).

There's also a built-in CLI pypdfium2 extract-images as testing utility. Its implementation demonstrates how to use the above APIs.

However, due to limitations in pdfium's public interface, pypdfium2 is by far not as good at image extraction as would technically be possible. You may want to consider pikepdf (MPL2-licensed), it's the best and most sophisticated tool for this task IMO.

(BTW, It's better to ask such questions on pypdfium2's discussions page on GitHub, then I'm much more likely to respond.)

mara004
  • 1,435
  • 11
  • 24
  • See also a previous answer of mine on this topic: https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python/75188677#75188677 – mara004 May 07 '23 at 12:45
-1

You can use pdfimages, a command line tool (linux).
It is efficient, will support 6 images formats and can convert all of them to png if you need uniformity.

user3435121
  • 633
  • 4
  • 13
  • Thanks for providing the workaround, but this option won't work for my use case because I need the page number from which the image came from, hence the request for options using pypdfium2. – americanthinker Apr 19 '23 at 14:03
  • @americanthinker if you add -p to the command, every filename will contain the page number and the image number. Example: pdfimages -p -png doc will generate files with names "doc-1-12.png", "doc-2-14.png", etc – user3435121 Apr 19 '23 at 22:04
  • pdfimages belongs to poppler, so theoretically the GPL would apply, which the OP doesn't want. Apart from that, a library is considerably more elegant than an external program. – mara004 May 07 '23 at 17:30