0
import tika
from tika import parser
import pytesseract
from PIL import Image
import numpy
import scipy
from tika import config

tika.initVM()

headers={'X-Tika-OCRLanguage': 'eng','X-Tika-PDFextractInlineImages': 'true','X-Tika-PDFOcrStrategy': 'ocr_and_text_extraction'}

parsed_pdf = parser.from_file("Tespdf.pdf",headers=headers)

data = parsed_pdf['content'] 

# Printing of content 
print(data)

I added pytesseract,numpy and scikit-image to preprocess the images. I have successfully tested image files using pytesseract however if I install them in a pdf and use tika I am not getting the text...

ScottyCov
  • 21
  • 5

0 Answers0