import tika
from tika import parser
import pytesseract
from PIL import Image
import numpy
import scipy
from tika import config
tika.initVM()
headers={'X-Tika-OCRLanguage': 'eng','X-Tika-PDFextractInlineImages': 'true','X-Tika-PDFOcrStrategy': 'ocr_and_text_extraction'}
parsed_pdf = parser.from_file("Tespdf.pdf",headers=headers)
data = parsed_pdf['content']
# Printing of content
print(data)
I added pytesseract,numpy and scikit-image to preprocess the images. I have successfully tested image files using pytesseract however if I install them in a pdf and use tika I am not getting the text...