1

I am trying to read an image from a pdf following this post: Extract images from PDF without resampling, in python?

So far I managed to get the image file from the pdf, but it contains a CWYK color scheme and the picture is becoming messed up.

My code is the following:

import PyPDF2
import struct

pdf_filename = 'document.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(4)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
    print(xObject[obj])
    if xObject[obj]['/Subtype'] == '/Image':
        if xObject[obj]['/Filter'] == '/DCTDecode':                        
            data = xObject[obj]._data            
            img = open("image" + ".jpg", "wb")
            img.write(data)
            img.close()

pdf_file.close()

The point is that when I save, the colors are all weird, I believe it's because of the colorScheme. I have the following in the console:

{'/Type': '/XObject', '/Subtype': '/Image', '/Width': 1122, '/Height': 502, '/Interpolate': <PyPDF2.generic.BooleanObject object at 0x1061574a8>, '/ColorSpace': '/DeviceCMYK', '/BitsPerComponent': 8, '/Filter': '/DCTDecode'}

As you can see, the ColorSpace is CMYK, and I believe that's why the colors of the image are weird.

That's the image I have:

Dolhphin with weird colors

This is the original image (it is inside a pdf file):

Original

Can anyone help me?

Thanks in advance. Israel

Community
  • 1
  • 1
Israel Zinc
  • 2,713
  • 2
  • 18
  • 30

1 Answers1

0

A CMYK mode JPG image that contained in PDF must be invert.

But in PIL, invert of CMYK mode image is not supported. Than I solve it using numpy.

Full source is in below link. https://github.com/Gaia3D/pdfImageExtractor/blob/master/extrectImage.py

imgData = np.frombuffer(img.tobytes(), dtype='B')
invData = np.full(imgData.shape, 255, dtype='B')
invData -= imgData
img = Image.frombytes(img.mode, img.size, invData.tobytes())
img.save(outFileName + ".jpg")