1

I'm trying to extract images from a PDF file using pdfminer.six

There doesn't seem to be any documentation about how to do this with Python.

This is what I have so far:

import os
import pdfminer

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

os.chdir('C:\\Users\\zone_\\Desktop')
diretorio = os.getcwd()
file = str(diretorio) + '\\example.pdf'

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

This is extracting the text, but how to retrieve the images in the pdf?

unstuck
  • 563
  • 2
  • 12
  • 29

2 Answers2

4

Recent versions of PDFMiner provide pdfminer.image.ImageWriter class, which handles image serializing. You can use it like this:

import pdfminer
from pdfminer.image import ImageWriter
from pdfminer.high_level import extract_pages

pages = list(extract_pages('document.pdf'))
page = pages[0]


def get_image(layout_object):
    if isinstance(layout_object, pdfminer.layout.LTImage):
        return layout_object
    if isinstance(layout_object, pdfminer.layout.LTContainer):
        for child in layout_object:
            return get_image(child)
    else:
        return None


def save_images_from_page(page: pdfminer.layout.LTPage):
    images = list(filter(bool, map(get_image, page)))
    iw = ImageWriter('output_dir')
    for image in images:
        iw.export_image(image)


save_images_from_page(page)

Also you can use a command line tool, as explained in the documentation.

hellpanderr
  • 5,581
  • 3
  • 33
  • 43
  • Is it possible to get the file name + the byte stream of the image, but not write it directly to disk? – Martin Thoma Sep 25 '22 at 08:28
  • 1
    @MartinThoma I don't see how with the current ImageWriter class https://github.com/euske/pdfminer/blob/master/pdfminer/image.py#L62 – hellpanderr Sep 26 '22 at 05:28
2

I have never used pdfminer, however I found this code and this document from Denis Papathanasiou explaining it, which might be of some help to figure this out, as pdfminer's documentation is not very exhaustive. The document is from an outdated version, but the code was recently updated.

If you are not required to use pdfminer, there are alternatives which might be easier such as PyMuPDF found in this answer which extracts all images in the PDF as PNG.

Jofre
  • 3,718
  • 1
  • 23
  • 31
Alex L
  • 157
  • 8