I have tried reading a PDF file with tabular data with texts and succeed it. But i have an image which is in PDF format and contains some text which need to be fetched for record purpose.All the PDFs are in a specific folder. I know only basics in python. Could anyone help me with this?
Asked
Active
Viewed 430 times
1
-
This is a duplicate. Check out this post: https://stackoverflow.com/questions/17630650/simple-python-library-for-recognition-text-from-image – Floam Nov 20 '19 at 04:23
-
https://tabula.technology/ this could probably solve your problems using the coordinates of the your particular field you are extracting – aayush_malik Nov 20 '19 at 05:28
-
Try this one: https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file – Lê Tư Thành Nov 20 '19 at 08:01
-
I have tried with pyPDF2 . it recognizes tabular data and texts in pdf which are converted from MS word to PDF but i need to read an image which has some random texts .Can anyone help in that? – Prithivi Raj Nov 20 '19 at 11:53
1 Answers
0
You can extract the both images (inline & XObject) and texts (plain and containing PDF operators) from PDF document using pdfreader
Here is a sample code extracting all the above from all document pages.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
images = []
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
images.extend(viewer.canvas.inline_images)
images.extend(viewer.canvas.images.values())
viewer.next()
except PageDoesNotExist:
pass
You can also convert images to PIL/Pillow object and save
for i, img in enumerate(images):
img.to_Pillow().save("{}.png".format(i))

Maksym Polshcha
- 18,030
- 8
- 52
- 77