3

Can anyone suggest me how to convert a scanned image into a searchable image or a scanned pdf to a searchable pdf ?
I have been stuck in this situation since quite a while now.
i have tried pdfocr application in ubuntu but no success.

sunny
  • 708
  • 11
  • 23

2 Answers2

4

Tesseract version 3.03 supports creation of searchable PDF from image. For PDF, you can use GhostScript to convert it to image before sending it to Tesseract.

https://github.com/tesseract-ocr/tesseract

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • I thought Tesseract just gives you a String of the text OCR'd from the PDF/image. Will it take a scanned-image PDF and turn it into a searchable-text PDF? (without altering the format/text too much!) ? – Don Cheadle Oct 21 '14 at 20:48
  • 1
    As said, you'll need to convert PDF to image first; Tesseract does not read PDF natively. – nguyenq Oct 22 '14 at 01:42
  • 1
    How to write the data retrieved by image_to_pdf_or_hocr to a pdf file? – Sidath Asiri Apr 25 '19 at 03:50
1

Currently, there is no right way of doing this on Ubuntu. All OCR engines output plain text and there is no way to add that text as a hidden layer on PDF over the image text.

Option 1: Use gscan2pdf which will make you a searchable PDF, but the OCRed text is placed in the top-left corner of the page, is invisible and much too small.

Option 2: Use PDF X-Change Viewer which has an option to OCR and works correctly by adding a text layer over the scanned image which is in concordance with it. You'll have to run it in wine, because it is a Windows application.

Cornelius
  • 250
  • 4
  • 15
  • Thanks cornelius for the options. Is pdf x-change viewer software free to use for commercial purpose ? – sunny Jul 20 '14 at 09:30
  • @user2722127 yes: "The FREE PDF viewer download of the PDF-XChange Viewer may be used without limitation for Private, Commercial, Government and all uses, provided it is not -: incorporated or distributed for profit/commercial gain with other software or media distribution of any type - without first gaining permission." – Cornelius Jul 20 '14 at 09:33