Extract images from PDF without resampling, in python?

Question

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.

Thanks. That "how images are stored in PDF" url didn't work, but this seems to: http://www.jpedal.org/PDFblog/2010/04/understanding-the-pdf-file-format-how-are-images-stored/ — nealmcb, Dec 09 '11 at 19:57
There is a [JPedal](http://www.jpedal.org) java library which does this called [PDF Clipped Image Extraction](http://www.jpedal.org/support_egCI.php). The author, Mark Stephens, has a concise highlevel overview of [how images are stored in PDF](http://www.jpedal.org/PDFblog/2010/04/understanding-the-pdf-file-format-how-are-images-stored/) which may help someone building a python extractor. — matt wilkie, Dec 11 '15 at 21:41
Link above from @nealmcb moved to https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-how-are-images-stored/ — Gruber, May 19 '21 at 04:50
Revived from deleted post: _"...an article explaining how images are stored inside a PDF at http://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-how-are-images-stored/"_ an informative page, making it clear this is a more complicated operation than first thought: _"All this means that if you want to extract images from a PDF, you need to assemble the image from all the raw data - it is not stored as a complete image file you can just rip out."_ The author has a java program which tackles this challenge. — matt wilkie, May 27 '22 at 21:04

score 95 · Answer 1 · edited Jun 09 '22 at 10:56

95

You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.

import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

see here for more resources

Here is a modified the version for fitz 1.19.6:

import os
import fitz  # pip install --upgrade pip; pip install --upgrade pymupdf
from tqdm import tqdm # pip install tqdm

workdir = "your_folder"

for each_path in os.listdir(workdir):
    if ".pdf" in each_path:
        doc = fitz.Document((os.path.join(workdir, each_path)))

        for i in tqdm(range(len(doc)), desc="pages"):
            for img in tqdm(doc.get_page_images(i), desc="page_images"):
                xref = img[0]
                image = doc.extract_image(xref)
                pix = fitz.Pixmap(doc, xref)
                pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))
                
print("Done!")

edited Jun 09 '22 at 10:56

Eugene

130
3
17

answered Dec 18 '17 at 23:26

kateryna

1,239
9
10

2

This works great! (`pip install pymudf` needed first obviously) – Basj May 22 '18 at 20:02
16

*`pip install pymupdf` for the fellow googlers who are wondering why the above install fails – VSZM Sep 19 '18 at 21:50
12

Instead of `pip install pymupdf` trying `pip install PyMuPDF` [more info](https://pymupdf.readthedocs.io/en/latest/installation/#step-1-download-pymupdf) – Damotorie Oct 29 '18 at 05:56
This package is quite helpful (and well documented) and deserves upvotes. – Evan Mata Apr 03 '19 at 18:58
1

With this code I get `RuntimeError: pixmap must be grayscale or rgb to write as png`, can anyone help? – vault Sep 12 '19 at 10:21
9

@vault This comment is outdated. You should change "if pix.n < 5" to "if pix.n - pix.alpha < 4" as the original condition does not correctly finds CMYK images. – Oringa Mar 09 '20 at 14:02
This snippet may fail to find what look like images but aren't. The package author has a helpful response to this at https://github.com/pymupdf/PyMuPDF/issues/469 – havlock Nov 17 '20 at 09:19
i had to also `pip install fitz` – sol Apr 09 '22 at 17:06
maybe this is obvious, but you can also `import sys` and use `sys.argv[1]` instead of hard-coding a file name if you want to have a drag-and-drop script solution :) – settwi Apr 23 '22 at 00:44
The code for 19.6 does `image = doc.extract_image(xref)` but then doesn't use `image` ❓ (If anyone wants a fix, cmyk too, let me know -- else adding a 100 th answer is a waste of time.) – denis Dec 15 '22 at 11:20
The code for 1.19.6 works fine. Code for the most recent version (1.22.5 now) throws error: for img in doc.getPageImageList(i): AttributeError: 'Document' object has no attribute 'getPageImageList' – lmocsi Jul 18 '23 at 10:08
In PyMuPDF version 1.22.5 you should use doc.get_page_images(i) instead of doc.getPageImageList(i) and pix.save instead of pix.writePNG in the first code example. With these modifications, it works fine for me. – lmocsi Jul 18 '23 at 10:46

score 58 · Answer 2 · edited Jun 18 '23 at 10:49

58

In Python with pypdf and Pillow libraries it is simple:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
for page in reader.pages:
    for image in page.images:
        with open(image.name, "wb") as fp:
            fp.write(image.data)

Please note: PyPDF2 is deprecated. Use pypdf.

edited Jun 18 '23 at 10:49

Martin Thoma

124,992
159
614
958

answered Dec 06 '15 at 10:41

sylvain

711
5
9

1

A related question [here](https://stackoverflow.com/questions/46184239/python-extract-a-page-from-a-pdf-as-a-jpeg).. – vishvAs vAsuki Sep 12 '17 at 19:49
1

Finds the images for me, but they are cropped/sized wrong, all b&w and have horizontal lines :( – Petri Oct 14 '17 at 10:47
@Petri Had the same issue. Just use `img = Image.frombytes('RGB', size, data)`. It works for .png/.jpg/.tiff files so far for me. Although, you may run into some problems I haven't fully tested all use cases. – Darius Mandres Nov 22 '18 at 18:52
Hi, to solve the `NotImplementedError: unsupported filter /CCITTFaxDecode` problem the library must be manually installed from the master branch of the github page? Installing it with `pip install PyPDF2` won't work? – crash Apr 09 '19 at 14:20
I followed @vishvAsvAsuki [link](https://stackoverflow.com/questions/46184239/python-extract-a-page-from-a-pdf-as-a-jpeg) but this packages gives images with white border, so removed it following this [stackoverflow question](https://stackoverflow.com/questions/10615901/trim-whitespace-using-pil?answertab=votes#tab-top) – hru_d May 06 '19 at 13:59
@matt wilkie the problem is not with sylvain's answer. If you trace back the code, you will see that the author of PyPDF2 did not implement those 2 filters, as seen in this link on lines 348 and 353: https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/filters.py – rmutalik Sep 16 '21 at 20:13
1

Most comments here should probably be removed as they are outdated: (1) PyPDF2 is way better maintained in the past months than PyPDF4 (2) PyPDF2 has fixed several long-standing bugs (3) PyPDF2 just got a way simpler interface for accessing images – Martin Thoma Sep 26 '22 at 06:29
@joe can you please try again with the most recent version of PyPDF2? – Martin Thoma Dec 17 '22 at 09:42
1

@MartinThoma, it worked without errors on version `2.12.1`. – Joe Dec 19 '22 at 14:19
`pypdf>=3.10.0` was just released with vastly improved image extraction. – Martin Thoma Jun 18 '23 at 10:49

score 32 · Answer 3 · answered Apr 23 '10 at 00:08

32

Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.

answered Apr 23 '10 at 00:08

Ned Batchelder

364,293
75
561
662

1

thanks Ned. It looks like the particular pdf's I need this for are not using jpeg in-situ, but I'll keep your sample around in case it matches up other things that turn up. – matt wilkie Apr 28 '10 at 22:16
3

Can you please explain a few things in the code? For example, why would you search for "stream" first and then for `startmark`? you could just start searching the `startmark` as this is the start of JPG no? and what's the point of the `startfix` variable, you dont change it at all.. – user3599803 Aug 27 '16 at 23:10
This worked perfectly for the PDF I wanted to extract images from. (In case it helps anyone else, I saved his code as a .py file, then installed/used Python 2.7.18 to run it, passing the path to my PDF as the single command-line argument.) – matt Apr 26 '20 at 22:47

score 25 · Answer 4 · answered Jan 01 '16 at 10:34

In Python with PyPDF2 for CCITTFaxDecode filter:

import PyPDF2
import struct

"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""


def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
    tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
    return struct.pack(tiff_header_struct,
                       b'II',  # Byte order indication: Little indian
                       42,  # Version number (always 42)
                       8,  # Offset to first IFD
                       8,  # Number of tags in IFD
                       256, 4, 1, width,  # ImageWidth, LONG, 1, width
                       257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                       258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                       259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                       262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                       273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                       278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                       279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of image
                       0  # last IFD
                       )

pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
    page = cond_scan_reader.getPage(i)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            """
            The  CCITTFaxDecode filter decodes image data that has been encoded using
            either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
            designed to achieve efficient compression of monochrome (1 bit per pixel) image
            data at relatively low resolutions, and so is useful only for bitmap image data, not
            for color images, grayscale images, or general data.

            K < 0 --- Pure two-dimensional encoding (Group 4)
            K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
            K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
            """
            if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                if xObject[obj]['/DecodeParms']['/K'] == -1:
                    CCITT_group = 4
                else:
                    CCITT_group = 3
                width = xObject[obj]['/Width']
                height = xObject[obj]['/Height']
                data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                img_size = len(data)
                tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                img_name = obj[1:] + '.tiff'
                with open(img_name, 'wb') as img_file:
                    img_file.write(tiff_header + data)
                #
                # import io
                # from PIL import Image
                # im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()

This worked immediately for me, and it's extremely fast!! All my images came out inverted, but I was able to fix that with OpenCV. I've been using ImageMagick's `convert` using `subprocess` to call it but it is painfully slow. Thanks for sharing this solution — crld, Oct 13 '16 at 22:06
As [pointed out elsewhere](https://stackoverflow.com/q/2641770/#comment69643967_34555343) your `tiff_header_struct` should read `'<' + '2s' + 'H' + 'L' + 'H' + 'HHLL' * 8 + 'L'`. Note in particular the `'L'` at the end. — Dispenser, Mar 25 '19 at 16:42
Any help on this please: https://stackoverflow.com/questions/55899363/how-to-extract-charts-tables-graphs-from-pdf-files-using-python — Aakash Basu, May 20 '19 at 10:46

score 18 · Answer 5 · edited Dec 06 '17 at 16:57

18

Libpoppler comes with a tool called "pdfimages" that does exactly this.

(On ubuntu systems it's in the poppler-utils package)

http://poppler.freedesktop.org/

http://en.wikipedia.org/wiki/Pdfimages

Windows binaries: http://blog.alivate.com.au/poppler-windows/

edited Dec 06 '17 at 16:57

matt wilkie

17,268
24
80
115

answered Aug 29 '10 at 21:03

dkagedal

578
2
7
14

I would love if someone found a Python module that doesn't rely on `pdfimages` being installed on the subsystem. – user1717828 May 09 '17 at 13:38
it doesn't output images pagewise – Alok Nayak Apr 14 '18 at 13:53
2

pdfimages often fails for images that are composed of layers, outputting individual layers rather than the image-as-viewed. – swestrup Dec 25 '21 at 19:09
@swestrup did you find a solution for this issue? – CVname Jan 09 '23 at 09:53
1

@CVname - Alas, no, I haven't. – swestrup Jan 10 '23 at 17:52

score 10 · Answer 6 · answered Sep 19 '18 at 23:29

10

I prefer minecart as it is extremely easy to use. The below snippet show how to extract images from a pdf:

#pip install minecart
import minecart

pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)

page = doc.get_page(0) # getting a single page

#iterating through all pages
for page in doc.iter_pages():
    im = page.images[0].as_pil()  # requires pillow
    display(im)

answered Sep 19 '18 at 23:29

VSZM

1,341
2
17
31

Hi there, minecart works perfectly but I got a small problem: sometimes the layout of the images is changed (horizontal -> vertical). Do you have any idea how I could avoid this? Thanks! – Sha Li Jul 31 '19 at 11:59
3

With minecart I get: pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode – Javi12 Dec 14 '20 at 20:14
display is not defined – Azhar Uddin Sheikh Sep 08 '21 at 17:09
I get AttributeError: module 'pdfminer.pdfparser' has no attribute 'PDFDocument' – swestrup Dec 25 '21 at 19:32

andrewdotn · Answer 7 · 2023-01-21T15:25:09.200

10

PikePDF can do this with very little code:

from pikepdf import Pdf, PdfImage

filename = "sample-in.pdf"
example = Pdf.open(filename)

for i, page in enumerate(example.pages):
    for j, (name, raw_image) in enumerate(page.images.items()):
        image = PdfImage(raw_image)
        out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")

extract_to will automatically pick the file extension based on how the image is encoded in the PDF.

If you want, you could also print some detail about the images as they get extracted:

        # Optional: print info about image
        w = raw_image.stream_dict.Width
        h = raw_image.stream_dict.Height
        f = raw_image.stream_dict.Filter
        size = raw_image.stream_dict.Length

        print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")

which can print something like

Wrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg
Wrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png
...

See the docs for more that you can do with images, including replacing them in the PDF file.

While this usually works pretty well, note that there are a number of images that won’t be extracted this way:

Vector graphics, such as embedded SVG/PS/PDF; you can crop the original PDF, but I’m not aware of an easy way to do this programmatically
Certain monochrome images compressed inside the PDF using “CCITTFaxDecode, type G4, with the /EncodedByteAlign set to true”
Non-RGB/CMYK images, aka ProcessColorModel/DeviceN/HiFi, used for colour separations (Thanks mara004)

edited Jan 21 '23 at 15:25

answered Feb 09 '21 at 13:03

andrewdotn

32,721
10
101
130

I tested this and it does exactly what I needed, thanks!. One point, `filter = raw_image.stream_dict.Filter` gives an error because `filter` is a function. When I change the name, I still get an error, `NotImplementedError: don't know how to __str__ this object`. I haven't been able to figure out what datatype .filter has. – Hobbes Feb 12 '21 at 11:06
Thanks for the comment. I’ve renamed `filter` to `f` to avoid the collision with Python’s built-in `filter()` function. `raw_image.stream_dict.Filter` is an instance of `pikepdf.objects.Object` for me; it seems to have a `to_json()` method you could try if `str()` isn’t doing what you want. But the PDF spec also indicates Filter may also be a *list* which might be part of what you’re seeing? That would be specific to the PDF you’re trying it on. You could try `print(type(f))` and `print(dir(f))` to see `f`’s type, attributes, and methods. – andrewdotn Feb 13 '21 at 18:00
3

This looks like it is now the easiest and most effective answer. I wish I'd seen it before I tried to implement this using PyPDF! One thing to mention: pikepdf crashed when I tried to export JBIG2 data, so then I installed `jbig2dec` (`conda install jbig2dec`) and it worked well. The code above saves image data directly if possible (DCTDecode > jpg, JPXDecode > jp2, CCITTFaxDecode > tif), and otherwise saves in a lossless PNG (JBIG2Decode, FlateDecode). I don't think you can do much better than that. – Matthias Fripp Jun 24 '21 at 23:35
For Windows, I compiled the jbig2dec file using Visual Studio and placed it in the Windows directory. The source code is here: https://jbig2dec.com/. In the bat file: `call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars32.bat"` `"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.30.30704\bin\Hostx86\x86\nmake.exe" msvc.mak` – Rufat Oct 19 '21 at 17:54
I tried this on a 56-page document full of images, and it only found ONE image on page 53. No idea what the issue is. – swestrup Dec 25 '21 at 19:25
pikepdf is the technically best library known to me for PDF image extraction. All the others don't properly handle all cases, are lossy, etc. Only for images with HiFi printer colorspace (Separation, DeviceN), I have to resort to pypdfium2's bitmap based extraction. – mara004 Jan 20 '23 at 19:24
Interesting, thanks for the info and pointer! So far the only unsupported image format issue I’ve run into with pikepdf is “CCITTFaxDecode, type G4, with the /EncodedByteAlign set to true,” but It seems there are more. – andrewdotn Jan 20 '23 at 22:17

Alex Paramonov · Answer 8 · 2019-08-02T20:27:34.577

Here is my version from 2019 that recursively gets all images from PDF and reads them with PIL. Compatible with Python 2/3. I also found that sometimes image in PDF may be compressed by zlib, so my code supports decompression.

#!/usr/bin/env python3
try:
    from StringIO import StringIO
except ImportError:
    from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib


def get_color_mode(obj):

    try:
        cspace = obj['/ColorSpace']
    except KeyError:
        return None

    if cspace == '/DeviceRGB':
        return "RGB"
    elif cspace == '/DeviceCMYK':
        return "CMYK"
    elif cspace == '/DeviceGray':
        return "P"

    if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
        color_map = obj['/ColorSpace'][1].getObject()['/N']
        if color_map == 1:
            return "P"
        elif color_map == 3:
            return "RGB"
        elif color_map == 4:
            return "CMYK"


def get_object_images(x_obj):
    images = []
    for obj_name in x_obj:
        sub_obj = x_obj[obj_name]

        if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
            images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())

        elif sub_obj['/Subtype'] == '/Image':
            zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
            if zlib_compressed:
               sub_obj._data = zlib.decompress(sub_obj._data)

            images.append((
                get_color_mode(sub_obj),
                (sub_obj['/Width'], sub_obj['/Height']),
                sub_obj._data
            ))

    return images


def get_pdf_images(pdf_fp):
    images = []
    try:
        pdf_in = PdfFileReader(open(pdf_fp, "rb"))
    except:
        return images

    for p_n in range(pdf_in.numPages):

        page = pdf_in.getPage(p_n)

        try:
            page_x_obj = page['/Resources']['/XObject'].getObject()
        except KeyError:
            continue

        images += get_object_images(page_x_obj)

    return images


if __name__ == "__main__":

    pdf_fp = "test.pdf"

    for image in get_pdf_images(pdf_fp):
        (mode, size, data) = image
        try:
            img = Image.open(StringIO(data))
        except Exception as e:
            print ("Failed to read image with PIL: {}".format(e))
            continue
        # Do whatever you want with the image

This code worked for me, with almost no modifications. Thank you. — xax, Jul 10 '20 at 04:25

score 6 · Answer 9 · answered May 05 '16 at 15:57

6

I started from the code of @sylvain There was some flaws, like the exception NotImplementedError: unsupported filter /DCTDecode of getData, or the fact the code failed to find images in some pages because they were at a deeper level than the page.

There is my code :

import PyPDF2

from PIL import Image

import sys
from os import path
import warnings
warnings.filterwarnings("ignore")

number = 0

def recurse(page, xObject):
    global number

    xObject = xObject['/Resources']['/XObject'].getObject()

    for obj in xObject:

        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj]._data
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(imagename + ".png")
                number += 1
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(imagename + ".jpg", "wb")
                img.write(data)
                img.close()
                number += 1
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(imagename + ".jp2", "wb")
                img.write(data)
                img.close()
                number += 1
        else:
            recurse(page, xObject[obj])



try:
    _, filename, *pages = sys.argv
    *pages, = map(int, pages)
    abspath = path.abspath(filename)
except BaseException:
    print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
    sys.exit()


file = PyPDF2.PdfFileReader(open(filename, "rb"))

for p in pages:    
    page0 = file.getPage(p-1)
    recurse(p, page0)

print('%s extracted images'% number)

answered May 05 '16 at 15:57

Labo

2,482
2
18
38

This code fails for me on '/ICCBased' '/FlateDecode' filtered images with `img = Image.frombytes(mode, size, data) ValueError: not enough image data` – GrantD71 Nov 28 '17 at 22:45
2

@GrantD71 I am not an expert, and never heard of ICCBased before. Plus your error is not reproducible if you don't provide the inputs. – Labo Nov 29 '17 at 23:49
I get a `KeyError: '/ColorSpace'`, so I would replace your line with DeviceRGB by `if '/ColorSpace' not in xObject[obj] or xObject[obj]['/ColorSpace'] == '/DeviceRGB':`. Anyway, this didn't work for me at the end because the images were probably PNG (not sure). – Basj May 22 '18 at 20:00
@Basj my code is supposed to work with PNG too. What is the value of `xObject[obj]['/Filter']`? – Labo May 22 '18 at 20:25
It is `/CCITTFaxDecode`. Then [this code](https://stackoverflow.com/a/34555398/1422096) works. Erratum: I now see my files are a lot of .tiff files but not PNG – Basj May 22 '18 at 20:26
Perfect! It seems I had the same problem as the version I use is updated: https://www.dropbox.com/s/0w4wlifdu82mmaa/PDF_extract_images.py?dl=0 – Labo May 22 '18 at 21:47
2

I adapted your code to work on both Python 2 and 3. I also implemented the /Indexed change from Ronan Paixão. I also changed the filter if/elif to be 'in' rather than equals. I had a PDF with the /Filter type ['/ASCII85Decode', '/FlateDecode']. I also changed the function to return image blobs rather than write to file. The updated code can be found here: https://gist.github.com/gstorer/f6a9f1dfe41e8e64dcf58d07afa9ab2a – Gerald Aug 01 '18 at 10:18
@Gerald awesome, thanks! I'll look at the code and update my dropbox :) – Labo Aug 01 '18 at 10:47
@GrantD71I have the same error on '/FlateDecode'. I can't make sense of it. Did you ever end up figuring it out? I created a test .pdf with 2 images inside. One .png, one .jpg. The .jpg one extracts just fine but the .pdf one gives this error. – Darius Mandres Nov 22 '18 at 02:31
After a few tests on many PDFs, neither @Sylvain's version, this version, nor Gerald's gist version works reliably, sadly. Still, big up for the effort! – Basj Jun 02 '20 at 18:34
1

PyPDF2 now supports image extraction out of the box – Martin Thoma Dec 17 '22 at 09:45

score 6 · Answer 10 · answered Feb 07 '20 at 06:21

I did this for my own program, and found that the best library to use was PyMuPDF. It lets you find out the "xref" numbers of each image on each page, and use them to extract the raw image data from the PDF.

import fitz
from PIL import Image
import io

filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)

#loads the first page
page = doc.loadPage(0)

#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]

#gets the image as a dict, check docs under extractImage 
baseImage = doc.extractImage(xref)

#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))

#Displays image for good measure
image.show()

Definitely check out the docs, though.

Best option IMO:After installing `fitz`on Win 10, I got the error: ModuleNotFoundError: No module named 'frontend', which was easily solved by installing `pip install PyMuPDF`as discussed here: https://stackoverflow.com/questions/56467667/modulenotfounderror-no-module-named-frontend — Peter, May 02 '20 at 11:01

score 6 · Answer 11 · answered Mar 21 '20 at 17:10

Well I have been struggling with this for many weeks, many of these answers helped me through, but there was always something missing, apparently no one here has ever had problems with jbig2 encoded images.

In the bunch of PDF that I am to scan, images encoded in jbig2 are very popular.

As far as I understand there are many copy/scan machines that scan papers and transform them into PDF files full of jbig2 encoded images.

So after many days of tests decided to go for the answer proposed here by dkagedal long time ago.

Here is my step by step on linux: (if you have another OS I suggest to use a linux docker it's going to be much easier.)

First step:

apt-get install poppler-utils

Then I was able to run command line tool called pdfimages like this:

pdfimages -all myfile.pdf ./images_found/

With the above command you will be able to extract all the images contained in myfile.pdf and you will have them saved inside images_found (you have to create images_found before)

In the list you will find several types of images, png, jpg, tiff; all these are easily readable with any graphic tool.

Then you will have some files named like: -145.jb2e and -145.jb2g.

These 2 files contain ONE IMAGE encoded in jbig2 saved in 2 different files one for the header and one for the data

Again I have lost many days trying to find out how to convert those files into something readable and finally I came across this tool called jbig2dec

So first you need to install this magic tool:

apt-get install jbig2dec

then you can run:

jbig2dec -t png -145.jb2g -145.jb2e

You are going to finally be able to get all extracted images converted into something useful.

good luck!

This is useful information and **it should be documented and shared**, as you have just done. +1. However I suggest posting as your own new question and then self-answer because it doesn't address doing this in python, which is point of this Q. (Feel free to cross-link the posts as this _is_ related.) — matt wilkie, Mar 24 '20 at 23:20
Hi @mattwilkie, thanks for the advice, here is the question: https://stackoverflow.com/questions/60851124/extract-images-from-pdf-how-to-handle-jbig2-encoded — Marco, Mar 26 '20 at 12:58
If you want a more "Pythonic" approach, you can also use the PikePDF solution in [another answer](https://stackoverflow.com/a/66119560/3830997). If you install `jbig2dec` (can be done with `conda`), that will also convert jbig2 images to png automatically. — Matthias Fripp, Jun 24 '21 at 23:55

user1847 · Answer 12 · 2017-12-11T20:58:25.177

5

Much easier solution:

Use the poppler-utils package. To install it use homebrew (homebrew is MacOS specific, but you can find the poppler-utils package for Widows or Linux here: https://poppler.freedesktop.org/). First line of code below installs poppler-utils using homebrew. After installation the second line (run from the command line) then extracts images from a PDF file and names them "image*". To run this program from within Python use the os or subprocess module. Third line is code using os module, beneath that is an example with subprocess (python 3.5 or later for run() function). More info here: https://www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/

brew install poppler

pdfimages file.pdf image

import os
os.system('pdfimages file.pdf image')

or

import subprocess
subprocess.run('pdfimages file.pdf image', shell=True)

edited Dec 11 '17 at 20:58

answered Feb 18 '17 at 02:14

user1847

3,571
1
26
35

1

Thanks Colton. Homebrew is MacOS only. It's good practice to note OS when instructions are platform specific. – matt wilkie Dec 06 '17 at 17:15
@mattwilkie -- Thanks for the heads up. Will note this in my answer. – user1847 Dec 07 '17 at 00:57
For Windows, you may want to download Poppler [here](https://github.com/oschwartz10612/poppler-windows/releases). Also, you need to add the path `C:\poppler-23.08.0\Library\bin` to your environment path variable (`C:\poppler-23.08.0` will depend on the version you downloaded and where you'll unzip it). – Vincent Stragier Aug 10 '23 at 18:58

Max A. H. Hartvigsen · Answer 13 · 2017-06-08T04:31:55.987

After some searching I found the following script which works really well with my PDF's. It does only tackle JPG, but it worked perfectly with my unprotected files. Also is does not require any outside libraries.

Not to take any credit, the script originates from Ned Batchelder, and not me. Python3 code: extract jpg's from pdf's. Quick and dirty

import sys

with open(sys.argv[1],"rb") as file:
    file.seek(0)
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

That looks interesting. Where did you find it? (And, formatting in your post is a bit messed up. Unbalanced quotes I think.) — matt wilkie, Jun 07 '17 at 02:44
https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html you can find the original post here... — Max A. H. Hartvigsen, Jun 07 '17 at 08:49

score 4 · Answer 14 · edited May 13 '20 at 09:36

After reading the posts using pyPDF2.

The error while using @sylvain's code NotImplementedError: unsupported filter /DCTDecode must come from the method .getData(): It is solved when using ._data instead, by @Alex Paramonov.

So far I have only met "DCTDecode" cases, but I am sharing the adapted code that include remarks from the different posts: From zilb by @Alex Paramonov, sub_obj['/Filter'] being a list, by @mxl.

Hope it can help the pyPDF2 users. Follow the code:

    import sys
    import PyPDF2, traceback
    import zlib
    try:
        from PIL import Image
    except ImportError:
        import Image

    pdf_path = 'path_to_your_pdf_file.pdf'
    input1 = PyPDF2.PdfFileReader(open(pdf_path, "rb"))
    nPages = input1.getNumPages()

    for i in range(nPages) :
        page0 = input1.getPage(i)

        if '/XObject' in page0['/Resources']:
            try:
                xObject = page0['/Resources']['/XObject'].getObject()
            except :
                xObject = []

            for obj_name in xObject:
                sub_obj = xObject[obj_name]
                if sub_obj['/Subtype'] == '/Image':
                    zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
                    if zlib_compressed:
                       sub_obj._data = zlib.decompress(sub_obj._data)

                    size = (sub_obj['/Width'], sub_obj['/Height'])
                    data = sub_obj._data#sub_obj.getData()
                    try :
                        if sub_obj['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        elif sub_obj['/ColorSpace'] == '/DeviceCMYK':
                            mode = "CMYK"
                            # will cause errors when saving (might need convert to RGB first)
                        else:
                            mode = "P"

                        fn = 'p%03d-%s' % (i + 1, obj_name[1:])
                        if '/Filter' in sub_obj:
                            if '/FlateDecode' in sub_obj['/Filter']:
                                img = Image.frombytes(mode, size, data)
                                img.save(fn + ".png")
                            elif '/DCTDecode' in sub_obj['/Filter']:
                                img = open(fn + ".jpg", "wb")
                                img.write(data)
                                img.close()
                            elif '/JPXDecode' in sub_obj['/Filter']:
                                img = open(fn + ".jp2", "wb")
                                img.write(data)
                                img.close()
                            elif '/CCITTFaxDecode' in sub_obj['/Filter']:
                                img = open(fn + ".tiff", "wb")
                                img.write(data)
                                img.close()
                            elif '/LZWDecode' in sub_obj['/Filter'] :
                                img = open(fn + ".tif", "wb")
                                img.write(data)
                                img.close()
                            else :
                                print('Unknown format:', sub_obj['/Filter'])
                        else:
                            img = Image.frombytes(mode, size, data)
                            img.save(fn + ".png")
                    except:
                        traceback.print_exc()
        else:
            print("No image found for page %d" % (i + 1))

pypdf2 is still being updated. As per this [github issue](https://github.com/py-pdf/PyPDF2/issues/571) there is a new maintainer. — piedpiper, Dec 01 '22 at 00:38

score 3 · Answer 15 · edited Mar 29 '12 at 13:00

3

I installed ImageMagick on my server and then run commandline-calls through Popen:

 #!/usr/bin/python

 import sys
 import os
 import subprocess
 import settings

 IMAGE_PATH = os.path.join(settings.MEDIA_ROOT , 'pdf_input' )

 def extract_images(pdf):
     output = 'temp.png'
     cmd = 'convert ' + os.path.join(IMAGE_PATH, pdf) + ' ' + os.path.join(IMAGE_PATH, output)
     subprocess.Popen(cmd.split(), stderr=subprocess.STDOUT, stdout=subprocess.PIPE)

This will create an image for every page and store them as temp-0.png, temp-1.png .... This is only 'extraction' if you got a pdf with only images and no text.

edited Mar 29 '12 at 13:00

mdb

52,000
11
64
62

answered Mar 29 '12 at 08:40

TompaLompa

949
6
17

1

Image magick uses ghostscript to do this. You can check [this post](http://www.imagemagick.org/discourse-server/viewtopic.php?f=1&t=16598#p60922) for the ghostscript command that image magick uses under the covers. – Filipe Correia May 31 '12 at 11:19
1

I have to say that sometimes the rendering is really bad. With poppler it works without any issue. – Raffi Nov 12 '15 at 14:02

score 2 · Answer 16 · answered Mar 23 '16 at 01:38

I added all of those together in PyPDFTK here.

My own contribution is handling of /Indexed files as such:

for obj in xObject:
    if xObject[obj]['/Subtype'] == '/Image':
        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
        color_space = xObject[obj]['/ColorSpace']
        if isinstance(color_space, pdf.generic.ArrayObject) and color_space[0] == '/Indexed':
            color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262
        mode = img_modes[color_space]

        if xObject[obj]['/Filter'] == '/FlateDecode':
            data = xObject[obj].getData()
            img = Image.frombytes(mode, size, data)
            if color_space == '/Indexed':
                img.putpalette(lookup.getData())
                img = img.convert('RGB')
            img.save("{}{:04}.png".format(filename_prefix, i))

Note that when /Indexed files are found, you can't just compare /ColorSpace to a string, because it comes as an ArrayObject. So, we have to check the array and retrieve the indexed palette (lookup in the code) and set it in the PIL Image object, otherwise it stays uninitialized (zero) and the whole image shows as black.

My first instinct was to save them as GIFs (which is an indexed format), but my tests turned out that PNGs were smaller and looked the same way.

I found those types of images when printing to PDF with Foxit Reader PDF Printer.

score 2 · Answer 17 · answered Feb 06 '19 at 07:53

As of February 2019, the solution given by @sylvain (at least on my setup) does not work without a small modification: xObject[obj]['/Filter'] is not a value, but a list, thus in order to make the script work, I had to modify the format checking as follows:

import PyPDF2, traceback

from PIL import Image

input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
print nPages

for i in range(nPages) :
    print i
    page0 = input1.getPage(i)
    try :
        xObject = page0['/Resources']['/XObject'].getObject()
    except : xObject = []

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            try :
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                elif xObject[obj]['/ColorSpace'] == '/DeviceCMYK':
                    mode = "CMYK"
                    # will cause errors when saving
                else:
                    mode = "P"

                fn = 'p%03d-%s' % (i + 1, obj[1:])
                print '\t', fn
                if '/FlateDecode' in xObject[obj]['/Filter'] :
                    img = Image.frombytes(mode, size, data)
                    img.save(fn + ".png")
                elif '/DCTDecode' in xObject[obj]['/Filter']:
                    img = open(fn + ".jpg", "wb")
                    img.write(data)
                    img.close()
                elif '/JPXDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".jp2", "wb")
                    img.write(data)
                    img.close()
                elif '/LZWDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".tif", "wb")
                    img.write(data)
                    img.close()
                else :
                    print 'Unknown format:', xObject[obj]['/Filter']
            except :
                traceback.print_exc()

Hello @Modem Rakesh goud, could you please provide the PDF file that triggered this error? Thank you! — mxl, Oct 11 '19 at 14:16
Or would you eventually be in the possession of a program like Acrobat (not Reader, but the PRO version), or alternatively another PDF editing program which can extract a portion of the PDF and provide only that portion, or, just give me the `traceback.print_exc()` of the given error line, so that I can see what triggered it; or maybe opt for another of the solutions here on this site, as the one given here (to my understanding) is focused on providing a 1:1 lossless extraction of data from a PDF and may not be what you are looking for, thanks! — mxl, Oct 15 '19 at 14:02
not sure why, but `/XObject` doesn't exists in any page i'm trying to run it on — Ricky Levi, Jul 21 '22 at 11:44
@RickyLevi could you provide the file which causes this behavior, also, what are your versions of `python` and `PyPDF2` — mxl, Aug 07 '22 at 11:31

SuperNova · Answer 18 · 2018-08-10T08:20:58.860

1

You could use pdfimages command in Ubuntu as well.

Install poppler lib using the below commands.

sudo apt install poppler-utils

sudo apt-get install python-poppler

pdfimages file.pdf image

List of files created are, (for eg.,. there are two images in pdf)

image-000.png
image-001.png

It works ! Now you can use a subprocess.run to run this from python.

edited Aug 10 '18 at 08:20

answered Aug 08 '18 at 09:48

SuperNova

25,512
7
93
64

score 1 · Answer 19 · answered Apr 18 '20 at 10:40

Try below code. it will extract all image from pdf.

    import sys
    import PyPDF2
    from PIL import Image
    pdf=sys.argv[1]
    print(pdf)
    input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
    for x in range(0,input1.numPages):
        xObject=input1.getPage(x)
        xObject = xObject['/Resources']['/XObject'].getObject()
        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                print(size)
                data = xObject[obj]._data
                #print(data)
                print(xObject[obj]['/Filter'])
                if xObject[obj]['/Filter'][0] == '/DCTDecode':
                    img_name=str(x)+".jpg"
                    print(img_name)
                    img = open(img_name, "wb")
                    img.write(data)
                    img.close()
        print(str(x)+" is done")

PyPDF2 now supports image extraction out of the box – Martin Thoma Dec 17 '22 at 09:50 — Martin Thoma, Dec 17 '22 at 09:50

score 1 · Answer 20 · answered Feb 15 '22 at 07:33

1

I rewrite solutions as single python class. It should be easy to work with. If you notice new "/Filter" or "/ColorSpace" then just add it to internal dictionaries.

https://github.com/survtur/extract_images_from_pdf

Requirements:

Python3.6+
PyPDF2
PIL

answered Feb 15 '22 at 07:33

Alexander C

3,597
1
23
39

mara004 · Answer 21 · 2023-02-23T17:13:13.140

With pypdfium2 (v4):

import pypdfium2.__main__ as pdfium_cli

pdfium_cli.api_main(["extract-images", "input.pdf", "-o", "output_dir"])

There are some options to choose between different extraction strategies (see pypdfium2 extract-images --help).

Actual non-CLI Python APIs are available as well. The CLI's implementation demonstrates them (see the docs for details):

# assuming `args` is a given options set (e. g. argparse namepsace)

import pypdfium2 as pdfium
import pypdfium2.raw as pdfium_c

pdf = pdfium.PdfDocument(args.input)

images = []
for i in args.pages:
    page = pdf.get_page(i)
    obj_searcher = page.get_objects(
        filter = (pdfium_c.FPDF_PAGEOBJ_IMAGE, ),
        max_depth = args.max_depth,
    )
    images += list(obj_searcher)

n_digits = len(str(len(images)))

for i, image in enumerate(images):
    prefix = args.output_dir / ("%s_%0*d" % (args.input.stem, n_digits, i+1))
    
    try:
        if args.use_bitmap:
            pil_image = image.get_bitmap(render=args.render).to_pil()
            pil_image.save("%s.%s" % (prefix, args.format))
        else:
            image.extract(prefix, fb_format=args.format, fb_render=args.render)
    except pdfium.PdfiumError:
        traceback.print_exc()

Note: Unfortunately, PDFium's public image extraction APIs are quite limited, so PdfImage.extract() is by far not as smart as pikepdf. If you only need the image bitmap and do not intend to save the image, PdfImage.get_bitmap() should be quite fine, though.

(Disclaimer: I'm the author of pypdfium2)

score 0 · Answer 22 · answered Jan 05 '23 at 06:23

Following code is updated version of PyMUPDF :

doc = fitz.open("/Users/vignesh/Downloads/ViewJournal2244.pdf")
Images_per_page={}
for i in page:
    images=[]
    for image_box in doc[page].get_images():
        rect=doc[page].get_image_rects(image_box)
        page=doc[page].get_pixmap(matrix=fitz.Identity,clip=rect[0],dpi=None,colorspace=fitz.csRGB,alpha=True, annots=True) 
        string=page.tobytes()
        images.append(string)
    Images_per_page[i]=images

score 0 · Answer 23 · answered Jan 31 '23 at 14:06

This worked for me:

import PyPDF2
from PyPDF2 import PdfFileReader

# Open the PDF file
pdf_file = open(r"C:\\Users\\file.pdf", 'rb')
pdf_reader = PdfFileReader(pdf_file)

# Iterate through each page
for page_num in range(pdf_reader.numPages):
    page = pdf_reader.getPage(page_num)
    xObject = page['/Resources']['/XObject'].getObject()

    # Iterate through each image on the page
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            # You can now save the image data to a file
            with open(f'C:\\Users\\filepath\{obj}.jpg', 'wb') as img_file:
                img_file.write(data)

# Close the PDF file
pdf_file.close()

score -1 · Answer 24 · answered Nov 23 '20 at 11:14

First Install pdf2image

pip install pdf2image==1.14.0

Follow the below code for extraction of pages from PDF.

file_path="file path of PDF"
info = pdfinfo_from_path(file_path, userpw=None, poppler_path=None)
maxPages = info["Pages"]
image_counter = 0
if maxPages > 10:
    for page in range(1, maxPages, 10):
        pages = convert_from_path(file_path, dpi=300, first_page=page, 
                last_page=min(page+10-1, maxPages))
        for page in pages:
            page.save(image_path+'/' + str(image_counter) + '.png', 'PNG')
            image_counter += 1
else:
    pages = convert_from_path(file_path, 300)
    for i, j in enumerate(pages):
        j.save(image_path+'/' + str(i) + '.png', 'PNG')

Hope it helps coders looking for easy conversion of PDF files to Images as per pages of PDF.

This will convert the PDF into images, but it does not extract the images from the remaining text. — user3072843, Jul 01 '21 at 13:14

Extract images from PDF without resampling, in python?

24 Answers24

Linked

Related