3

I have the need to develop a system that turns an image into a searchable PDF. How is a school work i need something with open source After much research I found tessnet2 (tesseract) and I can remove a text the image in tiff format. But how to convert this information into a PDF? Attention : I need to keep the file structure.

I need a direction to proceed with my research. Someone help me please.

thank you

msantiago
  • 346
  • 2
  • 4
  • 14
  • I guess to be able to do this you would need an OCR library that would do the job for you. It is a little too complicated to be able to discuss on QnA site. – Shakti Prakash Singh Nov 29 '13 at 13:22
  • Shakti What do you suggest me? – msantiago Nov 29 '13 at 13:24
  • I suggest using: [link](http://www.codeproject.com/Articles/196168/Contour-Analysis-for-Image-Recognition-in-C) just like I do myself for this type of work. Code can be learned to recognize new contour from both scans as Fonts. I use it myself for license plate detection. – online Thomas Nov 29 '13 at 13:58
  • user2754599 - As I understand it would help me to detect the text, great! But how to convert to searchable pdf? – msantiago Nov 29 '13 at 14:43

1 Answers1

2

There is a couple of .NET hOCR-to-PDF libraries that you may want to check out at Tesseract 3rdParty page.

Adam Plocher
  • 13,994
  • 6
  • 46
  • 79
nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • Already being very useful, have any examples of how to apply on windows? – msantiago Nov 29 '13 at 18:26
  • [hOcr2Pdf.NET](http://hocrtopdf.codeplex.com/documentation) site has some code example. You can use [Tesseract 3.x .NET wrapper](https://github.com/charlesw/tesseract) to output hOCR strings to be used as input to the library. – nguyenq Nov 30 '13 at 00:22