0

Using the code I found in this post, which corrects this example code, I'm trying to extract images on all pages of a PDF file. Now I'm getting an error for PNG images (works for JPG) at the second line of this piece (Image.frombytes):

if xObject[obj]['/Filter'] == '/FlateDecode':
    img = Image.frombytes(mode, size, data)
    img.save(imagename + ".png")
    number += 1

This yields ValueError: not enough image data, which seems to occur because data cannot be correctly decoded.

mrgou
  • 1,576
  • 2
  • 21
  • 45

1 Answers1

0

The code is incorrect as the PDF files do not embed full PNG images (as opposed to JPEG). The images with FlateDecode filter include only raw image data has been compressed with Flate method.

You have to decompress the data to get the raw image data, convert it to RGB (based on the colorspace defined on the PDF image image) and using the other properties defined on the PDF image object (Width, Height, etc) you can construct a PNG image.

iPDFdev
  • 5,229
  • 2
  • 17
  • 18
  • Now, that's interesting, as the original code has been out here for a while now! Replacing the line with `img = Image.frombytes(mode, size, zlib.decompress(data))` indeed removes the exception. However, the images are all black, so I guess the color space still needs work. Any suggestion? – mrgou Apr 04 '22 at 16:59
  • @mrgou I'm not familiar with Python and its PDF libraries but by looking at the code I can say that `img = Image.frombytes(mode, size, zlib.decompress(data))` works only for 24bit RGB images (`/ColorSpace /DeviceRGB /BitsPerComponent 8` in the PDF file). You have to handle the remaining 9 color spaces supported in PDF and the 1/2/4/16 additional values for /BitsPerComponent. Next you need to check for a mask and apply it in order to get the image you see on the PDF page. If you have a sample PDF file, post a link and I'll tell you the exact image details. – iPDFdev Apr 05 '22 at 13:40