21

I need to extract all the images from a PDF file on my server. I don't want the PDF pages, only the images at their original size and resolution.

How could I do this with Perl, PHP or any other UNIX based app (which I would invoke with the exec function from PHP)?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Anil
  • 3,912
  • 5
  • 35
  • 46

3 Answers3

24

pdfimages does just that. It's is part of the poppler-utils and xpdf-utils packages.

From the manpage:

Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.

Pdfimages reads the PDF file, scans one or more pages, PDF-file, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Luis Melgratti
  • 11,881
  • 3
  • 30
  • 32
11

With regards to Perl, have you checked CPAN?

Kent Fredric
  • 56,416
  • 14
  • 107
  • 150
3

pdfimages is nice as it does not reencode but only extract jpegs. But there is a bug:

pdfimages comes from package "poppler-utils" or from the bigger "xpdf-utils". At least in Ubuntu "poppler-utils" comes already pre-installed. The pdfimages in poppler-utils 10.0.3 (Ubuntu 9.04 Jaunty) still does not react to the option "-j" to extract ".jpg". It always extracts ".ppm".

As a workaround you may replace "poppler-utils" with "xpdf-utils": $ sudo apt-get install xpdf-utils

with kind regards,

+++ Oliver