How can I extract images from a PDF file?

Question

I need to extract all the images from a PDF file on my server. I don't want the PDF pages, only the images at their original size and resolution.

How could I do this with Perl, PHP or any other UNIX based app (which I would invoke with the exec function from PHP)?

How do you know where each image is on the page? To the best of my knowledge, PDF files do not record this information. — j_random_hacker, Jan 10 '09 at 08:32

score 24 · Accepted Answer · edited Jan 10 '09 at 19:28

24

pdfimages does just that. It's is part of the poppler-utils and xpdf-utils packages.

From the manpage:

Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.

Pdfimages reads the PDF file, scans one or more pages, PDF-file, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

edited Jan 10 '09 at 19:28

brian d foy

129,424
31
207
592

answered Jan 10 '09 at 15:03

Luis Melgratti

11,881
3
30
32

I think the package gets installed when you install xpdf. – PolyThinker Jan 10 '09 at 15:22
that is correct too, both packages have pdfimages. – Luis Melgratti Jan 10 '09 at 15:26

score 11 · Answer 2 · answered Jan 10 '09 at 09:32

11

With regards to Perl, have you checked CPAN?

PDF::GetImages - get images from pdf document
PDF::OCR - get ocr and images out of a pdf file
PDF::OCR2 - extract all text and all image ocr from pdf

answered Jan 10 '09 at 09:32

Kent Fredric

56,416
14
107
150

score 3 · Answer 3 · answered Jan 22 '09 at 12:13

pdfimages is nice as it does not reencode but only extract jpegs. But there is a bug:

pdfimages comes from package "poppler-utils" or from the bigger "xpdf-utils". At least in Ubuntu "poppler-utils" comes already pre-installed. The pdfimages in poppler-utils 10.0.3 (Ubuntu 9.04 Jaunty) still does not react to the option "-j" to extract ".jpg". It always extracts ".ppm".

As a workaround you may replace "poppler-utils" with "xpdf-utils": $ sudo apt-get install xpdf-utils

with kind regards,

+++ Oliver

on my ubuntu server neither xpdf nor poppler recognizes the `-j` switch — mbx, May 23 '11 at 09:43

How can I extract images from a PDF file?

3 Answers3

Linked

Related