1

Yes I hate myself for asking a pretty simple question.

I was hoping to get some advice for the best python library to extract images (of varying type) from a PDF.

I'm trying to take a PDF Drawing, save an image and it's position on the PDF from it, then place the saved image at the right position on a set of other PDFs.

I have tried afew to date but got stuck by various errors and the research I've done indicates there is no clear and obvious choice.

I have tried PyPDF2 but got an error around PNG filter 3 being unsupported. I have tried PDFMiner but it's constrained to JPEGs which while isn't a deal breaker I still can't get it to extract a JPEG. I have also tried fitz module from PyMuPDF and got 1 of 3 images on my PDF, however it was inverted colour, backwards, upside down. Though I'm sure there is post-processing for this

The code I have used, to be honest, is examples that people far smarter than me have come up with and I have modified them as necessary.

Fitz below

doc = fitz.open(pdf)
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

PyPDF2 below

if __name__ == '__main__':
    input1 = PyPDF2.PdfFileReader(pdf)
    page0 = input1.getPage(0)

if '/XObject' in page0['/Resources']:
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if '/Filter' in xObject[obj]:
                if xObject[obj]['/Filter'] == '/FlateDecode':
                    img = Image.frombytes(mode, size, data)
                    img.save(obj[1:] + ".png")
                elif xObject[obj]['/Filter'] == '/DCTDecode':
                    img = open(obj[1:] + ".jpg", "wb")
                    img.write(data)
                    img.close()
                elif xObject[obj]['/Filter'] == '/JPXDecode':
                    img = open(obj[1:] + ".jp2", "wb")
                    img.write(data)
                    img.close()
                elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                    img = open(obj[1:] + ".tiff", "wb")
                    img.write(data)
                    img.close()
            else:
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")

If you're reading this and you wrote either of the above, thanks for getting me this far haha. The

More looking for advice on what is the best library to proceed with rather than someone hold my hand with the code.

Appreciate any imparting of wisdom

Pete

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
p_smithuk
  • 11
  • 2

1 Answers1

0

pypdf can (now) do this. Straight from the docs:

from pypdf import PdfReader

reader = PdfReader("example.pdf")

page = reader.pages[0]
count = 0

for image_file_object in page.images:
    with open(str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958