How can I extract text from an image in a pdf using the python port of Apache/Tika 2.6.0?

Asked Jan 31 '23 at 19:13

Active Jan 31 '23 at 19:39

Viewed 117 times

import tika
from tika import parser
import pytesseract
from PIL import Image
import numpy
import scipy
from tika import config

tika.initVM()

headers={'X-Tika-OCRLanguage': 'eng','X-Tika-PDFextractInlineImages': 'true','X-Tika-PDFOcrStrategy': 'ocr_and_text_extraction'}

parsed_pdf = parser.from_file("Tespdf.pdf",headers=headers)

data = parsed_pdf['content'] 

# Printing of content 
print(data)

I added pytesseract,numpy and scikit-image to preprocess the images. I have successfully tested image files using pytesseract however if I install them in a pdf and use tika I am not getting the text...

edited Jan 31 '23 at 19:39

asked Jan 31 '23 at 19:13

ScottyCov

How can I extract text from an image in a pdf using the python port of Apache/Tika 2.6.0?

0 Answers0