How to read a TEXT in an image in PDF extension file using Python?

Question

I have tried reading a PDF file with tabular data with texts and succeed it. But i have an image which is in PDF format and contains some text which need to be fetched for record purpose.All the PDFs are in a specific folder. I know only basics in python. Could anyone help me with this?

This is a duplicate. Check out this post: https://stackoverflow.com/questions/17630650/simple-python-library-for-recognition-text-from-image — Floam, Nov 20 '19 at 04:23
https://tabula.technology/ this could probably solve your problems using the coordinates of the your particular field you are extracting — aayush_malik, Nov 20 '19 at 05:28
Try this one: https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file — Lê Tư Thành, Nov 20 '19 at 08:01
I have tried with pyPDF2 . it recognizes tabular data and texts in pdf which are converted from MS word to PDF but i need to read an image which has some random texts .Can anyone help in that? — Prithivi Raj, Nov 20 '19 at 11:53

Maksym Polshcha · Answer 1 · 2019-11-27T16:27:11.987

You can extract the both images (inline & XObject) and texts (plain and containing PDF operators) from PDF document using pdfreader

Here is a sample code extracting all the above from all document pages.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
images = []
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        images.extend(viewer.canvas.inline_images)
        images.extend(viewer.canvas.images.values())
        viewer.next()
except PageDoesNotExist:
    pass

You can also convert images to PIL/Pillow object and save

for i, img in enumerate(images):
    img.to_Pillow().save("{}.png".format(i))

How to read a TEXT in an image in PDF extension file using Python?

1 Answers1