1

I've written a script using python in combination with PyPDF2, PIL and pytesseract to extract the text from the first page of the scanned pages of a pdf file. However, when I tried the below script to get the content from the first scanned page out of that pdf file, It throws the following error when reaches the line containing img = Image.open(pdfReader.getPage(0)).convert('L').

Script I have tried so far:

import PyPDF2
import pytesseract
from PIL import Image

pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
img = Image.open(pdfReader.getPage(0)).convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()

Error I'm having:

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 8, in <module>
    img = Image.open(pdfReader.getPage(0)).convert('L')
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\site-packages\PIL\Image.py", line 2554, in open
    fp = io.BytesIO(fp.read())
AttributeError: 'PageObject' object has no attribute 'read'

How can I make it a go successfully?

SIM
  • 21,997
  • 5
  • 37
  • 109
  • Which OS are you using? Macos? @Topto – Tarun Lalwani Jun 27 '18 at 15:44
  • I'm using windows 7 @Tarun Lalwani. Your solution hardly fails so I'm very much hopeful now. Btw, do I need to install anything else other than `pdf2image` in order for your suggested script work? Plus one in advance.. – SIM Jun 27 '18 at 16:37
  • I have just integrated this code from that answer, so whatever that answer lists, probably just that – Tarun Lalwani Jun 27 '18 at 16:40

4 Answers4

6

You need to convert the pdf to image first and then do it

Python: Extract a page from a pdf as a jpeg

import PyPDF2
import pytesseract
from PIL import Image
from pdf2image import convert_from_path

pdfFileObj = r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf'
pages = convert_from_path(pdfFileObj, 500)

page = pages[0]
page.save('out.png')

img = Image.open('out.png').convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()
Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
3

Your problem is, you want PIL to read a PageObject defined by pypdf, which is a wrong way. You should convert pdf to image format, then use PIL to read it. In that case, wand is probably what you need. See its home page. Here is a sample to save all pages in a pdf with jpg format:

from wand.image import Image as WImage
with WImage(filename=your_pdf_path, resolution=(300,300)) as imgs:
    imgs.format = 'jpg'
    page_idx = 0
    for img in imgs.sequence:
        WImage(image=img).save(str(page_idx)+'.jpg')

Now, you might want to check out the api mentioned above, and solve your problem.

AssKicker
  • 124
  • 3
3

You can not read ():

pdfReader.getPage(0)

because it is not an image. From documentation we have that:


getPage(pageNumber)

Retrieves a page by number from this PDF file.

Parameters: pageNumber (int) – The page number to retrieve (pages begin at zero)

Returns: a PageObject instance.

Return type: PageObject


So in order to do things with it you need to read this classes documentation:

extractText()

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Returns: a unicode string object.

PageObject Doc


Summing up

pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
txt = (pdfReader.getPage(0)).extractText()

pdfFileObj.close()

print(txt)

In case you want an image so badly look at @TarunLalwani response which is more accurate.

KALALEX
  • 430
  • 1
  • 5
  • 19
1

img = Image.open(pdfReader.getPage(0), 'r').convert('L')

Jeru Luke
  • 20,118
  • 13
  • 80
  • 87
M. Fanfa
  • 31
  • 5
  • Your suggested portion makes it even worse. This is the error I'm having now `Traceback (most recent call last): File "some address", line 2552, in open fp.seek(0) AttributeError: 'PageObject' object has no attribute 'seek' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "some address", line 8, in img = Image.open(pdfReader.getPage(0), 'r').convert('L') File "some address", line 2554, in open fp = io.BytesIO(fp.read()) AttributeError: 'PageObject' object has no attribute 'read'`. – SIM Jun 24 '18 at 14:03