Python pdfminer extract image produces multiple images per page (should be single image)

Question

I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin).

I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides no result (as in cannot be opened via finder on OSX)

Sample:

fp = open('the_file.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)


for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    pdf_item = device.get_result()
    for thing in pdf_item:
        if isinstance(thing, LTImage):
            save_image(thing)
        if isinstance(thing, LTFigure):
            find_images_in_thing(thing)


def find_images_in_thing(outer_layout):
    for thing in outer_layout:
        if isinstance(thing, LTImage):
            save_image(thing)

save_image either writes a file per image in pageNum_imgNum format in 'wb' mode or a single image per page in 'a' mode. I have tried numerous file extensions with no luck.

Resources I've looked into:

http://denis.papathanasiou.org/posts/2010.08.04.post.html (outdatted pdfminer version) http://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

score 4 · Answer 1 · answered Aug 23 '17 at 20:04

It's been a while since this question has been asked, but I'll contribute for the sake of the community, and potentially for your benefit :)

I've been using an image parser called pdfimages, available through the poppler PDF processing framework. It also outputs several files per image; it seems like a relatively common behavior for PDF generators to 'tile' or 'strip' the images into multiple images that then need to be pieced together when scraping, but appear to be entirely intact while viewing the PDF. The formats/file extensions that I have seen through pdfimages and elsewhere are: png, tiff, jp2, jpg, ccitt. Have you tried all of those?

Dilshat · Answer 2 · 2020-01-21T12:36:48.950

2

Have you tried something like this?

from binascii import b2a_hex
def determine_image_type (stream_first_4_bytes):
    """Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes"""
       file_type = None
       bytes_as_hex = b2a_hex(stream_first_4_bytes).decode()
       if bytes_as_hex.startswith('ffd8'):
          file_type = '.jpeg'
       elif bytes_as_hex == '89504e47':
          file_type = '.png'
       elif bytes_as_hex == '47494638':
          file_type = '.gif'
       elif bytes_as_hex.startswith('424d'):
          file_type = '.bmp'
       return file_type

edited Jan 21 '20 at 12:36

answered Jan 20 '20 at 14:10

Dilshat

1,088
10
12

1

[source](http://denis.papathanasiou.org/archive/2010.08.04.post.pdf) for this code. has some useful pointers on dealing with PDFs with pdfminer. – piedpiper Nov 30 '22 at 23:23

score 2 · Answer 3 · answered Jul 09 '21 at 17:32

A (partial) solution for the image tiling problem is posted here: PDF: extracted images are sliced / tiled

I would use in image library to find the image type:

import io
from PIL import Image

image = Image.open(io.BytesIO(thing.stream.get_data()))
print(image.format)

Python pdfminer extract image produces multiple images per page (should be single image)

3 Answers3