Can't execute the following script successfully

Question

I've written a script using python in combination with PyPDF2, PIL and pytesseract to extract the text from the first page of the scanned pages of a pdf file. However, when I tried the below script to get the content from the first scanned page out of that pdf file, It throws the following error when reaches the line containing img = Image.open(pdfReader.getPage(0)).convert('L').

Script I have tried so far:

import PyPDF2
import pytesseract
from PIL import Image

pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
img = Image.open(pdfReader.getPage(0)).convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()

Error I'm having:

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 8, in <module>
    img = Image.open(pdfReader.getPage(0)).convert('L')
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\site-packages\PIL\Image.py", line 2554, in open
    fp = io.BytesIO(fp.read())
AttributeError: 'PageObject' object has no attribute 'read'

How can I make it a go successfully?

I'm using windows 7 @Tarun Lalwani. Your solution hardly fails so I'm very much hopeful now. Btw, do I need to install anything else other than `pdf2image` in order for your suggested script work? Plus one in advance.. — SIM, Jun 27 '18 at 16:37
I have just integrated this code from that answer, so whatever that answer lists, probably just that — Tarun Lalwani, Jun 27 '18 at 16:40

score 6 · Accepted Answer · answered Jun 27 '18 at 15:48

You need to convert the pdf to image first and then do it

Python: Extract a page from a pdf as a jpeg

import PyPDF2
import pytesseract
from PIL import Image
from pdf2image import convert_from_path

pdfFileObj = r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf'
pages = convert_from_path(pdfFileObj, 500)

page = pages[0]
page.save('out.png')

img = Image.open('out.png').convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()

score 3 · Answer 2 · answered Jun 24 '18 at 17:27

Your problem is, you want PIL to read a PageObject defined by pypdf, which is a wrong way. You should convert pdf to image format, then use PIL to read it. In that case, wand is probably what you need. See its home page. Here is a sample to save all pages in a pdf with jpg format:

from wand.image import Image as WImage
with WImage(filename=your_pdf_path, resolution=(300,300)) as imgs:
    imgs.format = 'jpg'
    page_idx = 0
    for img in imgs.sequence:
        WImage(image=img).save(str(page_idx)+'.jpg')

Now, you might want to check out the api mentioned above, and solve your problem.

KALALEX · Answer 3 · 2018-07-03T14:56:41.863

You can not read ():

pdfReader.getPage(0)

because it is not an image. From documentation we have that:

getPage(pageNumber)

Retrieves a page by number from this PDF file.

Parameters: pageNumber (int) – The page number to retrieve (pages begin at zero)

Returns: a PageObject instance.

Return type: PageObject

So in order to do things with it you need to read this classes documentation:

extractText()

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Returns: a unicode string object.

PageObject Doc

Summing up

pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
txt = (pdfReader.getPage(0)).extractText()

pdfFileObj.close()

print(txt)

In case you want an image so badly look at @TarunLalwani response which is more accurate.

score 1 · Answer 4 · edited Jul 01 '18 at 20:57

1

img = Image.open(pdfReader.getPage(0), 'r').convert('L')

edited Jul 01 '18 at 20:57

Jeru Luke

20,118
13
80
87

answered Jun 24 '18 at 13:36

M. Fanfa

31
5

Your suggested portion makes it even worse. This is the error I'm having now `Traceback (most recent call last): File "some address", line 2552, in open fp.seek(0) AttributeError: 'PageObject' object has no attribute 'seek' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "some address", line 8, in img = Image.open(pdfReader.getPage(0), 'r').convert('L') File "some address", line 2554, in open fp = io.BytesIO(fp.read()) AttributeError: 'PageObject' object has no attribute 'read'`. – SIM Jun 24 '18 at 14:03

Can't execute the following script successfully

4 Answers4