OCR library that can insert OCR'd text back into the source PDF

Question

Is there a library (or executable) that can OCR a PDF (typically a PDF created by scanning a paper), and inject the recognized text back into the PDF? Probably as invisible text behind the scanned images.

Preferably open source.

(Goal: I have a huge library of PDF files indexed by Lucene. It would be much easier for Lucene to find what PDFs are relevant if the PDFs contained text.)

Question moved to https://softwarerecs.stackexchange.com/questions/3656/create-searchable-pdf-files-using-ocr-from-scanned-pdfs-in-bulk — Nicolas Raoul, Feb 16 '18 at 13:42

score 0 · Answer 1 · answered Apr 27 '12 at 04:06

One of the best options is to probably use Abbyy FineReader as it will give you lots of options including the creation of hidden text. www.abbyy.com I had a quick look at their site and also came across their Transformer product which is probably even more suitable for your needs.

http://www.abbyy.com.au/pdftransformer/product_features/

score 0 · Answer 2 · answered Jan 17 '13 at 10:33

0

If PDFs doesn't contain text, what is indexed by Lucene?

Take a look at Docsplitt (https://github.com/documentcloud/docsplit) it can use Tesseract to perform OCR. You will get a plain text files, which reflects the content of PDFs. You can than build your Lucene index on top of these text files and store reference to PDF in Lucene index. After querying Lucene index you will get the list of Documents with references to original PDFs.

answered Jan 17 '13 at 10:33

maneo

39
2

The PDF does not contain text, it is like an image. I have text in another file, and want to inject it into the PDF. If possible, I would like to not touch Lucene configuration. My question is not about Lucene (I cited Lucene to illustrate, but it could be a non-configurable desktop search tool, for instance) – Nicolas Raoul Jan 17 '13 at 11:31
If so, this one seems to be a solution for your problem: [link](http://stackoverflow.com/questions/3335126/itext-add-content-to-existing-pdf-file). IText is one thing, you may also take a look at PDFbox. – maneo Jan 17 '13 at 20:52
Yes, a solution would probably involve something like iText/PDFbox indeed! The question you link too makes text apparent, though. I guess there is some good practice to embed invisible text, using either iText or PDFbox or something else. – Nicolas Raoul Jan 18 '13 at 02:23

OCR library that can insert OCR'd text back into the source PDF

2 Answers2