I am trying to read an image from a pdf following this post: Extract images from PDF without resampling, in python?
So far I managed to get the image file from the pdf, but it contains a CWYK color scheme and the picture is becoming messed up.
My code is the following:
import PyPDF2
import struct
pdf_filename = 'document.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(4)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
print(xObject[obj])
if xObject[obj]['/Subtype'] == '/Image':
if xObject[obj]['/Filter'] == '/DCTDecode':
data = xObject[obj]._data
img = open("image" + ".jpg", "wb")
img.write(data)
img.close()
pdf_file.close()
The point is that when I save, the colors are all weird, I believe it's because of the colorScheme. I have the following in the console:
{'/Type': '/XObject', '/Subtype': '/Image', '/Width': 1122, '/Height': 502, '/Interpolate': <PyPDF2.generic.BooleanObject object at 0x1061574a8>, '/ColorSpace': '/DeviceCMYK', '/BitsPerComponent': 8, '/Filter': '/DCTDecode'}
As you can see, the ColorSpace is CMYK, and I believe that's why the colors of the image are weird.
That's the image I have:
This is the original image (it is inside a pdf file):
Can anyone help me?
Thanks in advance. Israel