Error while image extraction from PDF in python

Question

I am trying to extract all formats of images from pdf. I did some googling and found this page on StackOverflow. I tried this code but I am getting this error:

I am using python 3.x and here is the code I am using. I tried to go through comments but couldn't figure out. Please help me resolve this.

Here is the sample PDF.

import PyPDF2

from PIL import Image

if __name__ == '__main__':
    input1 = PyPDF2.PdfFileReader(open("Aadhaar1.pdf", "rb"))
    page0 = input1.getPage(0)
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

I was reading some comments and going through links and found this problem solved on this page. Can someone please help me implement it?

Can you provide the input PDF, as well? It's much easier to help if we can reproduce the issue you're having with the code and files you're using. — GaryMBloom, Dec 09 '17 at 17:38
@Gary02127 Sorry for late reply Gary.Network in my location was down. I have tried with multiple pdf but same error.However, I have edited the question with sample PDF. — john, Dec 10 '17 at 04:29
It seems like the filter the PDF is using for images is not supported by the library `PyPDF2` you are using. I am not aware of any other PDF readers which do include this filter, but they might well be out there, I am not an expert. — physicalattraction, Dec 10 '17 at 09:05
@physicalattraction you really gave a good idea of what's going wrong. Actually there is a some solution in the page I mentioned. But It redirects me to GitHub. And I don't know how install library from GitHub.can you help? — john, Dec 10 '17 at 09:26
I got this resolved. Thanks, everyone. You all gave me a better picture. — john, Dec 13 '17 at 05:19

score 1 · Accepted Answer · edited Dec 13 '17 at 10:49

1

It is the PyPDF2 library error. Try uninstalling and installing the library with changes or you can see the changes in the GitHub and mark the changes.I hope that will work.

edited Dec 13 '17 at 10:49

john

85
2
10

answered Dec 13 '17 at 10:31

Sarwar Hayatt

38
6

This doesn't seem to be resolved on all kinds of PDF files yet. I'm still getting this error even after reinstalling the library using pip3. Are the github changes pushed to pypi package repository? – Prahlad Yeri Mar 11 '18 at 08:51
Here is the sample PDF where the library is throwing this error: http://www.hbp.com/resources/SAMPLE%20PDF.pdf – Prahlad Yeri Mar 11 '18 at 08:53
No, the file isn't merged yet. However I have merged that file for you, you can download and unzip it. Please install manually from the file here. https://1drv.ms/u/s!AnPgw7hXtChUhl3efS-Ty1cn74O8 – Sarwar Hayatt Mar 11 '18 at 12:48
Hi, to solve the problem the library must be manually installed from the master branch of the github page? Installing it with `pip install PyPDF2` won't work? – crash Apr 09 '19 at 14:17
Right..suggested edits has not been merged yet. I believe. – Hayat Apr 10 '19 at 16:12

score 0 · Answer 2 · answered Nov 07 '19 at 09:43

0

As of today, I'm still getting the error NotImplementedError: unsupported filter /DCTDecode

I've PyPDF2 v 1.26.0 installed, using Python3 3.7.5. My Python code is the same as above.

Is there a solution yet?

answered Nov 07 '19 at 09:43

Friso

127
1
10

score 0 · Answer 3 · answered Sep 06 '21 at 19:31

Same error for me with Python 3.9 and PyPDF2 1.26 at time of this writing.

data = xObject[obj].getData()

was the problem line. My PDF had JPG images, and that line was not working because of same NotImlemented exception. Changing the line for the /DCTDecode part to;

data = xObject[obj]._data

kind of worked for me. This gives plain JPG stream in the pdf. So ie separate data = ... lines for each if/filter section, though not tried the JP2 part.

Error while image extraction from PDF in python

3 Answers3