Convert Non-Searchable Pdf to Searchable Pdf in Windows Python

Question

Need a solution to convert a PDF file where every page is image and a page can either contains text, table or combination of both to a searchable pdf.

I have used ABBY FineReader Online which is doing the job perfectly well but I am looking for a solution which can be achieved via Windows Python

I have done detailed analysis and below are the links which came close to what I want but not exactly:

Scanned Image/PDF to Searchable Image/PDF

It is telling to use Ghost script to convert it 1st to image and then it does directly convert to text. I don't believe tesseract converts non-searchable to searchable PDF's.

Converting searchable PDF to a non-searchable PDF

The above solution helps in reverse i.e. converting searchable to non-searchable. Also I think these are valid in Ubuntu/Linux/MacOS.

Can someone please help in telling what should be the Python code for achieving non-searchable to searchable in Windows Python?

UPDATE 1

I have got the desired result with Asprise Web Ocr. Below is the link and code:

https://asprise.com/royalty-free-library/python-ocr-api-overview.html

I am looking for a solution which can be done through Windows Python libraries only as

Need not to pay subscription costs in future
I need to convert thousands of documents daily and it will be cumbersome to upload one to API and then download and so on.

UPDATE 2

I know the solution of converting non-searchable pdf directly to text. But I am looking is their any way to convert non-searchable to searchable PDF. I have the code for converting the PDF to text using PyPDF2.

score 6 · Answer 1 · answered Sep 05 '18 at 15:12

Well you don't actually need to transform everything inside the pdf to text. Text will remain text, table will remain table and if possible image should become text. You would need a script that actually reads the pdf as is, and begins the conversion on blocks. The script would write blocks of text until the document has been read completely and then transform it into a pdf. Something like

if line_is_text():
    write_the_line_as_is()
elif line_is_img():
    transform_img_in_text()# comments below code
...
..
.

Now transform_img_in_text() I think it could be done with many external libraries, one you can use could be:

Tesseract OCR Python

You can download this lib via pip, instructions provided in the link above.

Yes, I know this!! Tesseract OCR coverts pdf to text not unsearchable to searchable pdf. Also there are ghostscript issues which using python3 + tesseract. Trust me I have tried this!! :) — Rahul Agarwal, Sep 05 '18 at 15:15
So you are looking for an already made solution, not suggestions on how to make one. — Alexandru Martalogu, Sep 06 '18 at 07:49

iacolippo · Answer 2 · 2018-09-12T13:50:26.043

4

I've used pypdfocr in the past to do this. It hasn't been updated recently though.

From the README:

pypdfocr filename.pdf
--> filename_ocr.pdf will be generated

Read carefully the Install instructions for Windows.

A more recent Python library is OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF There is a Docker image for Windows

edited Sep 12 '18 at 13:50

answered Sep 04 '18 at 15:44

iacolippo

4,133
25
37

Already tried it ..not working or I am not sure how to make it work in Windows – Rahul Agarwal Sep 04 '18 at 15:48
Updated my answer with another possible solution – iacolippo Sep 12 '18 at 13:50

score 1 · Answer 3 · answered Sep 25 '18 at 22:59

1

I recently wrote a blog post where I accomplished this using:

OCRmyPDF - a python library wrapping Tesseract
docker container running in Azure

You may need to tweak things to meet your needs, but I believe the building blocks in this post could be applied to your needs:

http://martyice.github.io/docker-in-azure/

answered Sep 25 '18 at 22:59

Marty

1,182
2
13
22

Thanks Marty!! I am running the same on Windows..and there is a docker/poppler utils for Windows also..but the process becomes too long..As first I have to convert it to images(1 image per page) and then it is being converted to searchable PDF – Rahul Agarwal Sep 26 '18 at 07:26

Convert Non-Searchable Pdf to Searchable Pdf in Windows Python

3 Answers3