Extract images from PDF using python PyPDF2

Question

Is there any way to extract images as stream from pdf document (using PyPDF2 library)? Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?

I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.

>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')

I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.

Any code samples and links will be very helpful.

So you want to open a large pdf, extract a page(s), and add that page(s) to an existing pdf? Would it be ok to save that combined pdf as a new file? — ExperimentsWithCode, Mar 24 '14 at 14:59
This answer could help: http://stackoverflow.com/a/34116472/1513933 — Laurent LAPORTE, Nov 29 '16 at 22:41
Possible duplicate of [Extract images from PDF without resampling, in python?](http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python) — Florian Brucker, Apr 21 '17 at 15:48

jainam shah · Answer 1 · 2020-11-24T12:55:55.087

2

import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

edited Nov 24 '20 at 12:55

answered May 30 '19 at 13:08

jainam shah

199
1
11

5

Welcome to Stack Overflow! While this code snippet may solve the problem, it doesn't explain why or how it answers the question. Please [include an explanation for your code](//meta.stackexchange.com/q/114762/269535), as that really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Samuel Philipp May 30 '19 at 13:34
thanks @jainam shah it works for me. `pip install PyMuPDF` install this library and `import fitz` after it works. – Chandan Mar 03 '20 at 08:10

dataninsight · Answer 2 · 2021-11-25T03:25:24.237

Extracting Images from PDF

This code helps to fetch any images in scanned or machine generated pdf or normal pdf
determines its occurrence example how many images in each page
Fetches images with same resolution and extension

pip install PyMuPDF
import fitz
import io
from PIL import Image
#file path you want to extract images from
file = r"File_path"
#open the file
pdf_file = fitz.open(file)   
#iterate over PDF pages
    for page_index in range(pdf_file.page_count):
        #get the page itself
        page = pdf_file[page_index]
        image_li = page.get_images()
        #printing number of images found in this page
        #page index starts from 0 hence adding 1 to its content
        if image_li:
            print(f"[+] Found a total of {len(image_li)} images in page {page_index+1}")
        else:
            print(f"[!] No images found on page {page_index+1}")
        for image_index, img in enumerate(page.get_images(), start=1):
            #get the XREF of the image
            xref = img[0]
            #extract the image bytes
            base_image = pdf_file.extract_image(xref)
            image_bytes = base_image["image"]
            #get the image extension
            image_ext = base_image["ext"]
            #load it to PIL
            image = Image.open(io.BytesIO(image_bytes))
            #save it to local disk
            image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
     
         

`

score 1 · Answer 3 · answered Oct 13 '17 at 00:44

Image metadata is not stored within the encoded images of a PDF. If metadata is stored at all, it is stored in PDF itself, but stripped from the underlying image. The metadata you see in your example is likely all that you'll be able to get. It's possible that PDF encoders may store image metadata elsewhere in the PDF, but I haven't seen this. (Note this metadata question was also asked for Java.)

It's definitely possible to extract the stream however, as you mentioned, you use the getData operation.

As for replacing it, you'll need to create a new image object with the PDF, add it to the end, and update the indirect Object pointers accordingly. It will be difficult to do this with PyPdf2.

score 0 · Answer 4 · answered Jun 18 '23 at 11:29

As PyPDF2 became deprecated in the mean time, go to pypdf.

Extract images

Straight from the docs:

from pypdf import PdfReader

reader = PdfReader("example.pdf")

page = reader.pages[0]
count = 0

for image_file_object in page.images:
    with open(str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1

Replace images

Will go into the docs soon: https://github.com/py-pdf/pypdf/pull/1894

from pypdf import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)
for page in writer.pages:
    for img in page.images:
        img.replace(img.image, quality=80)
with open("out.pdf", "wb") as f:
    writer.write(f)

Extract images from PDF using python PyPDF2

4 Answers4

Extracting Images from PDF

Extract images

Replace images

Linked

Related