Need a solution to convert a PDF file where every page is image and a page can either contains text, table or combination of both to a searchable pdf.
I have used ABBY FineReader Online which is doing the job perfectly well but I am looking for a solution which can be achieved via Windows Python
I have done detailed analysis and below are the links which came close to what I want but not exactly:
Scanned Image/PDF to Searchable Image/PDF
It is telling to use Ghost script to convert it 1st to image and then it does directly convert to text. I don't believe tesseract converts non-searchable to searchable PDF's.
Converting searchable PDF to a non-searchable PDF
The above solution helps in reverse i.e. converting searchable to non-searchable. Also I think these are valid in Ubuntu/Linux/MacOS.
Can someone please help in telling what should be the Python code for achieving non-searchable to searchable in Windows Python?
UPDATE 1
I have got the desired result with Asprise Web Ocr. Below is the link and code:
https://asprise.com/royalty-free-library/python-ocr-api-overview.html
I am looking for a solution which can be done through Windows Python libraries only as
- Need not to pay subscription costs in future
- I need to convert thousands of documents daily and it will be cumbersome to upload one to API and then download and so on.
UPDATE 2
I know the solution of converting non-searchable pdf directly to text. But I am looking is their any way to convert non-searchable to searchable PDF. I have the code for converting the PDF to text using PyPDF2.