I'm looking for is a C# solution to import data from PDF documents into our database, in a commercial application. Our customers will be looking to import any arbitrary document. Ordinarily I'd write this off as a complete impossibility, but the documents they're importing will be in their own set layout.
My plan is to have the PDFs rendered to static images, then allow the users to set up their own templates, which essentially pull out text at predefined pixel-offsets in the PDF, using OCR. For tables, they define a location of the table and a bunch of further values for column and row sizes. We can then apply the template onto that document type.
So, what I'm really looking for is two libraries: one to convert PDFs to images, another to OCR those images.
Requirements:
- Is pure-C# or has a supported C# wrapper onto a native DLL.
- Doesn't fork out processes - wrappers that essentially just create command line parameters and launch an external executable aren't allowed in this case.
- In the case of FOSS, allows us to exempt ourselves from normal FOSS license requirements (i.e. publishing our sourcecode) by paying a license fee.
We certainly don't mind paying for a commercial solution, but we'd rather not get stuck with paying a fee per individual distribution of the software.
I know this is quite a specific requirement set - perhaps enough for some people to deem this question too localised, but I'm hoping that someone can suggest an approach and some libraries that can be helpful to me, as well as others in the future.
Stuff I've looked into for the PDF side:
- iTextSharp - Documentation is a book you have to buy, not a good start. Doesn't seem to be much useful documentation regarding turning PDFs into images in the public domain. Licensing is opaque, looks like we have to pay per client we distribute to.
- Docotic.Pdf - Text only, no use to us.
- pdftohtml - Again, doesn't produce images. Would be a mess to port to C# too.
- PdfFileParser - Still not what we need.
- GhostScript - Pretty much exactly what we want, but requires forking out to a program.
For the OCR side, I'll probably end up using Tesseract, since the Apache license is permissive and it's got good reviews. If there's an alternative, I'd be interested in that too.