4

I'm looking for is a C# solution to import data from PDF documents into our database, in a commercial application. Our customers will be looking to import any arbitrary document. Ordinarily I'd write this off as a complete impossibility, but the documents they're importing will be in their own set layout.

My plan is to have the PDFs rendered to static images, then allow the users to set up their own templates, which essentially pull out text at predefined pixel-offsets in the PDF, using OCR. For tables, they define a location of the table and a bunch of further values for column and row sizes. We can then apply the template onto that document type.

So, what I'm really looking for is two libraries: one to convert PDFs to images, another to OCR those images.

Requirements:

  • Is pure-C# or has a supported C# wrapper onto a native DLL.
  • Doesn't fork out processes - wrappers that essentially just create command line parameters and launch an external executable aren't allowed in this case.
  • In the case of FOSS, allows us to exempt ourselves from normal FOSS license requirements (i.e. publishing our sourcecode) by paying a license fee.

We certainly don't mind paying for a commercial solution, but we'd rather not get stuck with paying a fee per individual distribution of the software.

I know this is quite a specific requirement set - perhaps enough for some people to deem this question too localised, but I'm hoping that someone can suggest an approach and some libraries that can be helpful to me, as well as others in the future.

Stuff I've looked into for the PDF side:

  • iTextSharp - Documentation is a book you have to buy, not a good start. Doesn't seem to be much useful documentation regarding turning PDFs into images in the public domain. Licensing is opaque, looks like we have to pay per client we distribute to.
  • Docotic.Pdf - Text only, no use to us.
  • pdftohtml - Again, doesn't produce images. Would be a mess to port to C# too.
  • PdfFileParser - Still not what we need.
  • GhostScript - Pretty much exactly what we want, but requires forking out to a program.

For the OCR side, I'll probably end up using Tesseract, since the Apache license is permissive and it's got good reviews. If there's an alternative, I'd be interested in that too.

Oded
  • 489,969
  • 99
  • 883
  • 1,009
Polynomial
  • 27,674
  • 12
  • 80
  • 107
  • With PDF IFilter you can read PDF data and put it in a database. Example Foxit provides an IFilter component to read PDF docs. – robertpnl May 31 '12 at 10:45
  • `iTextSharp` license is the Affero GNU Public License. – Oded May 31 '12 at 10:46
  • @Robert-PaulHoving Isn't that a solution for PDFs that actually include text, though? These PDFs may just be a wrapper for a giant scanned image. I also need to be able to grab stuff at specific locations (pixel offsets) - does IFilter support that? – Polynomial May 31 '12 at 10:52
  • @Oded If I'm reading it correctly, the license requires that we distribute the source of our application when using AGPL. They have a release license, but it doesn't really do what we need anyway. – Polynomial May 31 '12 at 10:56
  • @Polynomial. IFilter search for text only. So scanned images with text will be not indexed. Pixel offsets, i don´t know. Maybe it can dependent on the type IFilter. – robertpnl May 31 '12 at 10:59
  • @Robert-PaulHoving Yeah, that's the problem. We can't rely on them actually being text. Hence why I said I'd like to render them to images (e.g. JPEG) and OCR them. – Polynomial May 31 '12 at 11:01

2 Answers2

2

I would like to recommend Amyuni PDF Creator .Net for this task.

1st Scenario:
If your PDF files are well defined (no missing font information etc) you could directly extract the text from the PDF by specifying a rectangular region in the method GetObjectsInRectangle. You should also use the option acGetRectObjectsOptimize:

Optimize text objects before returning them. That is, combine text objects that are close to each other into a single text object.

2nd Scenario:
If there are images involved that also contain text, rendering the whole page into an image and then applying OCR might be a better choice. You can do this with Amyuni PDF Creator .Net by using the methods ExportToTiff, ExportToJPeg, or RasterizePageRange.

From the documentation:

IacDocument.RasterizePageRange Method
The RasterizePageRange method converts page contents into a color or grey scale image. When archiving documents or performing OCR, it is sometimes preferable for all pages to be stored as images rather than complex text and graphic operations.

Then you can use our OCR add-in that integrates with Tesseract OCR and finally we fall again into the 1st Scenario (GetObjectsInRectangle). In order to apply OCR to your files you can use the method OCRPageRange.

void OCRPageRange(int startPage, int EndPage, string Language, acOCROptions Options)

About licensing, Amyuni PDF Creator .Net provides a (per application) royalty free license.

Usual disclaimer applies

yms
  • 10,361
  • 3
  • 38
  • 68
0

I think you might want to give Docotic.Pdf another chance.

The library can extract text chunks, words and even individual characters with their bounding rectangles. Please have a look at the sample for extraction of words from PDFs.

Also, Docotic.Pdf can create images from PDFs and draw pages on a System.Drawing.Graphics. Please have a look at Draw and print Pdf group of samples.

Disclaimer: I am one of developers of the library.

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
  • I didn't realise Docotic.Pdf had that functionality. Investigating now. If it works well, you may well have made yourself a sale! :) – Polynomial Jun 01 '12 at 10:10
  • I've played around with it, and the results look promising. However, the output images created when drawing the pages are poor resolution and barely legible. Is this a known issue, or am I doing something wrong? – Polynomial Jun 01 '12 at 12:01
  • Ignore previous, I just needed to zoom! – Polynomial Jun 01 '12 at 12:06