2

How can we extract images(only images) from PDF.

I used many online tools, they all are not universal. In most of the PDF, it tools the screenshot of the whole image instead of the image. PDF link -> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf

Yash Sharma
  • 55
  • 2
  • 4
  • What have you tried so far? – Nick May 30 '19 at 08:20
  • I have used some websites: http://www.pdfaid.com/ExtractImages.aspx https://pdfcandy.com/extract-images.html https://www.pdf-online.com/osa/extract.aspx – Yash Sharma May 30 '19 at 09:09
  • 1
    "whole image instead of the image" what do you mean by this? I would really recommend you post screenshots showing what you got, and clearly indicating what you wanted to get. – Ryan Jun 06 '19 at 06:52

4 Answers4

3

Here's a solution with PyMuPDF:

#!python3.6
import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.pageCount):
        for image in doc.getPageImageList(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.writePNG(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)
J. Owens
  • 832
  • 7
  • 9
2

Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image. You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.

import io
import pyPdf
import PIL.Image

infile_name = 'my.pdf'

with open(infile_name, 'rb') as in_f:
    in_pdf = pyPdf.PdfFileReader(in_f)
    for page_no in range(in_pdf.getNumPages()):
        page = in_pdf.getPage(page_no)

        # Images are part of a page's `/Resources/XObject`
        r = page['/Resources']
        if '/XObject' not in r:
            continue
        for k, v in r['/XObject'].items():
            vobj = v.getObject()
            # We are only interested in images...
            if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
                continue
            if vobj['/Filter'] == '/FlateDecode':
                # A raw bitmap
                buf = vobj.getData()
                # Notice that we need metadata from the object
                # so we can make sense of the image data
                size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
                img = PIL.Image.frombytes('RGB', size, buf,
                                          decoder_name='raw')
                # Obviously we can't really yield here, do something with `img`...
                yield img
            elif vobj['/Filter'] == '/DCTDecode':
                # A compressed image
                img = PIL.Image.open(io.BytesIO(vobj._data))
                yield img
user2722968
  • 13,636
  • 2
  • 46
  • 67
  • 1
    I have tried same logic before, but it didn't work . – Yash Sharma May 30 '19 at 09:13
  • Welcome to StackOverflow. Please frame your questions and comments with as much information as possible. "It does not work" is in no way helpful. Update your question to state what precisely you've tried and what "it does not work" mean. – user2722968 May 30 '19 at 09:14
  • It works. I have some installation issues with pyPdf. I used PyPDF2 instead of pyPdf and replaced "yield" parts with img.save(..). (Win x64 - Python 3.8) – M.Selman SEZGİN Jan 19 '20 at 13:23
1

Other solutions didn't work for me, so here's my solution:

Install PyMuPDF with:

pip install pymupdf

Create and run following script. This script assumes that PDF is stored in pdfs directory and extracted images needs to be stored in images directory inside current directory.

#!/usr/bin/env python3

import fitz

doc = fitz.open('pdfs/some.pdf')

image_xrefs = {}

for page in doc:
    for image in page.get_images():
        image_xrefs.setdefault(image[0])

for index, xref in enumerate(image_xrefs):
    img = doc.extract_image(xref)
    if img:
        with open(f'images/{index}.{img["ext"]}', 'wb') as image:
            image.write(img['image'])

rmalviya
  • 1,847
  • 12
  • 39
0

Not all PDFs are simply just text and image so in this Question case there is a hybrid as seen when the area around the figure image zone is selected. The hint is the file says Adobe Paper Capture so was OCRed and not all text was captured !! The OP expected the figure to be extractable from within the whole page. (this was a follow on from their previous comment)

"Moreover I also want to extract images from a section if there are any images there." "it tools the screenshot of the whole image instead of the image."

enter image description here

Hsps on the cellw ar surface Dead cells were gated by staining with propidium iodide.
~
(a) Control
~
cv
Ml
76.55
49.94
§
1-
M2
0.21
12.11
93.53
9.65
~
.. .,
"'
(b) Experimental
<I
Ml
3.49
100
10'
104
M2
93.31
232.80
99.24
283.87
Fig. 2a. Flow cytometric analysis of expression of GroEL on the surface of vegetative cells of B.

Using any pdfimage query tool we see that page has more silly entries than valid ones

pdfimages  -list -f 12 -l 12 -verbose "09_chapter 4.pdf" -
[processing page 12]
--0000.pbm: page=12 width=2412 height=3436 hdpi=300.00 vdpi=300.00 colorspace=DeviceGray bpc=1
--0001.pbm: page=12 width=1 height=1 hdpi=0.44 vdpi=2.03 mask bpc=1
--0002.pbm: page=12 width=1 height=1 hdpi=0.53 vdpi=2.59 mask bpc=1
--0003.pbm: page=12 width=1 height=1 hdpi=0.49 vdpi=2.27 mask bpc=1

and extract images will simply extract the scanned page and three files that are simply a 1x1 pixel dot ! Thus the outputs will look like only 25 % was recovered but not as the OP expected a source diagram/figure.

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36