Extract an image from a PDF in python

Question

I'm trying to extract images from a pdf using PyPDF2, but when my code gets it, the image is very different from what it should actually look like, look at the example below:

Text But this is how it should really look like:

Text

Here's the pdf I'm using:

https://www.hbp.com/resources/SAMPLE%20PDF.pdf

Here's my code:

pdf_filename = "SAMPLE.pdf"
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(0)

xObject = page['/Resources']['/XObject'].getObject()
i = 0
for obj in xObject:
    # print(xObject[obj])
    if xObject[obj]['/Subtype'] == '/Image':
        if xObject[obj]['/Filter'] == '/DCTDecode':
            data = xObject[obj]._data
            img = open("{}".format(i) + ".jpg", "wb")
            img.write(data)
            img.close()
            i += 1

And since I need to keep the image in it's colour mode, I can't just convert it to RBG if it was CMYK because I need that information. Also, I'm trying to get dpi from images I get from a pdf, is that information always stored in the image? Thanks in advance

pypdf has improved image extraction a lot. you might want to give `pypdf` (not `PyPDF2!) another shot. — Martin Thoma, Jun 18 '23 at 12:44

score 1 · Answer 1 · answered Dec 11 '19 at 17:21

Hope this works: you probably need to use another library such as Pillow:

Here is an example:


    from PIL import Image
    image = Image.open("path_to_image")
    if image.mode == 'CMYK':
        image = image.convert('RGB')
    image.write("path_to_image.jpg")

Reference: Convert from CMYK to RGB

score 1 · Accepted Answer · answered Dec 11 '19 at 20:03

I used pdfreader to extract the image from your example. The image uses ICCBased colorspace with the value of N=4 and Intent value of RelativeColorimetric. This means that the "closest" PDF colorspace is DeviceCMYK.

All you need is to convert the image to RGB and invert the colors.

Here is the code:

from pdfreader import SimplePDFViewer
import PIL.ImageOps 

fd = open("SAMPLE PDF.pdf", "rb")
viewer = SimplePDFViewer(fd)

viewer.render()
img = viewer.canvas.images['Im0']

# this displays ICCBased 4 RelativeColorimetric
print(img.ColorSpace[0], img.ColorSpace[1].N, img.Intent)

pil_image = img.to_Pillow()
pil_image = pil_image.convert("RGB")
inverted = PIL.ImageOps.invert(pil_image)


inverted.save("sample.png")

Read more on PDF objects: Image (sec. 8.9.5), InlineImage (sec. 8.9.7)

Extract an image from a PDF in python

2 Answers2