I need to extract the text from the pdf files.
The problem is some pages of the files is the scanned pdf, which the text can't be retrieved using the PyPDF or PDFMiner. So the text is empty.
Could anyone please give me a hint of how to process?
I don't think there's a quick solution to deal with the Unicode, especially the Japanese.
One of a solution that we could go:
import cv2
import pytesseract
from pytesseract import Output
img = cv2.imread('invoice-sample.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d.keys())
Regarding the tesseract, you can find more in this article.