Extracting images from pdf using Python

Question

How can we extract images(only images) from PDF.

I used many online tools, they all are not universal. In most of the PDF, it tools the screenshot of the whole image instead of the image. PDF link -> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf

I have used some websites: http://www.pdfaid.com/ExtractImages.aspx https://pdfcandy.com/extract-images.html https://www.pdf-online.com/osa/extract.aspx — Yash Sharma, May 30 '19 at 09:09
"whole image instead of the image" what do you mean by this? I would really recommend you post screenshots showing what you got, and clearly indicating what you wanted to get. — Ryan, Jun 06 '19 at 06:52

score 3 · Answer 1 · answered Jun 07 '19 at 20:29

Here's a solution with PyMuPDF:

#!python3.6
import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.pageCount):
        for image in doc.getPageImageList(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.writePNG(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)

score 2 · Answer 2 · answered May 30 '19 at 09:07

Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image. You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.

import io
import pyPdf
import PIL.Image

infile_name = 'my.pdf'

with open(infile_name, 'rb') as in_f:
    in_pdf = pyPdf.PdfFileReader(in_f)
    for page_no in range(in_pdf.getNumPages()):
        page = in_pdf.getPage(page_no)

        # Images are part of a page's `/Resources/XObject`
        r = page['/Resources']
        if '/XObject' not in r:
            continue
        for k, v in r['/XObject'].items():
            vobj = v.getObject()
            # We are only interested in images...
            if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
                continue
            if vobj['/Filter'] == '/FlateDecode':
                # A raw bitmap
                buf = vobj.getData()
                # Notice that we need metadata from the object
                # so we can make sense of the image data
                size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
                img = PIL.Image.frombytes('RGB', size, buf,
                                          decoder_name='raw')
                # Obviously we can't really yield here, do something with `img`...
                yield img
            elif vobj['/Filter'] == '/DCTDecode':
                # A compressed image
                img = PIL.Image.open(io.BytesIO(vobj._data))
                yield img

Welcome to StackOverflow. Please frame your questions and comments with as much information as possible. "It does not work" is in no way helpful. Update your question to state what precisely you've tried and what "it does not work" mean. — user2722968, May 30 '19 at 09:14
It works. I have some installation issues with pyPdf. I used PyPDF2 instead of pyPdf and replaced "yield" parts with img.save(..). (Win x64 - Python 3.8) — M.Selman SEZGİN, Jan 19 '20 at 13:23

score 1 · Answer 3 · answered Aug 22 '22 at 05:16

Other solutions didn't work for me, so here's my solution:

Install PyMuPDF with:

pip install pymupdf

Create and run following script. This script assumes that PDF is stored in pdfs directory and extracted images needs to be stored in images directory inside current directory.

#!/usr/bin/env python3

import fitz

doc = fitz.open('pdfs/some.pdf')

image_xrefs = {}

for page in doc:
    for image in page.get_images():
        image_xrefs.setdefault(image[0])

for index, xref in enumerate(image_xrefs):
    img = doc.extract_image(xref)
    if img:
        with open(f'images/{index}.{img["ext"]}', 'wb') as image:
            image.write(img['image'])

K J · Answer 4 · 2022-08-24T11:25:26.653

Not all PDFs are simply just text and image so in this Question case there is a hybrid as seen when the area around the figure image zone is selected. The hint is the file says Adobe Paper Capture so was OCRed and not all text was captured !! The OP expected the figure to be extractable from within the whole page. (this was a follow on from their previous comment)

"Moreover I also want to extract images from a section if there are any images there." "it tools the screenshot of the whole image instead of the image."

Hsps on the cellw ar surface Dead cells were gated by staining with propidium iodide.
~
(a) Control
~
cv
Ml
76.55
49.94
§
1-
M2
0.21
12.11
93.53
9.65
~
.. .,
"'
(b) Experimental
<I
Ml
3.49
100
10'
104
M2
93.31
232.80
99.24
283.87
Fig. 2a. Flow cytometric analysis of expression of GroEL on the surface of vegetative cells of B.

Using any pdfimage query tool we see that page has more silly entries than valid ones

pdfimages  -list -f 12 -l 12 -verbose "09_chapter 4.pdf" -
[processing page 12]
--0000.pbm: page=12 width=2412 height=3436 hdpi=300.00 vdpi=300.00 colorspace=DeviceGray bpc=1
--0001.pbm: page=12 width=1 height=1 hdpi=0.44 vdpi=2.03 mask bpc=1
--0002.pbm: page=12 width=1 height=1 hdpi=0.53 vdpi=2.59 mask bpc=1
--0003.pbm: page=12 width=1 height=1 hdpi=0.49 vdpi=2.27 mask bpc=1

and extract images will simply extract the scanned page and three files that are simply a 1x1 pixel dot ! Thus the outputs will look like only 25 % was recovered but not as the OP expected a source diagram/figure.

Extracting images from pdf using Python

4 Answers4

Linked