1

I have tried reading a PDF file with tabular data with texts and succeed it. But i have an image which is in PDF format and contains some text which need to be fetched for record purpose.All the PDFs are in a specific folder. I know only basics in python. Could anyone help me with this?

  • This is a duplicate. Check out this post: https://stackoverflow.com/questions/17630650/simple-python-library-for-recognition-text-from-image – Floam Nov 20 '19 at 04:23
  • https://tabula.technology/ this could probably solve your problems using the coordinates of the your particular field you are extracting – aayush_malik Nov 20 '19 at 05:28
  • Try this one: https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file – Lê Tư Thành Nov 20 '19 at 08:01
  • I have tried with pyPDF2 . it recognizes tabular data and texts in pdf which are converted from MS word to PDF but i need to read an image which has some random texts .Can anyone help in that? – Prithivi Raj Nov 20 '19 at 11:53

1 Answers1

0

You can extract the both images (inline & XObject) and texts (plain and containing PDF operators) from PDF document using pdfreader

Here is a sample code extracting all the above from all document pages.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
images = []
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        images.extend(viewer.canvas.inline_images)
        images.extend(viewer.canvas.images.values())
        viewer.next()
except PageDoesNotExist:
    pass

You can also convert images to PIL/Pillow object and save

for i, img in enumerate(images):
    img.to_Pillow().save("{}.png".format(i))
Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77