PDF to HTML and OCR solution for information extraction

Question

I'm looking for a solution for PDF to HTML and OCR service in the cloud or in the SDK format. After my searches, I see that there are bunch of services out there in the internet. I tried some of them and I got some idea. I'd like to know that if any of you use such service.

My biggest concerns are to have a automation structure to have an HTML output that I can use in the information extraction. I'd like to have structured data output like tables. (most of the services provide HTML output with the -character format (CSS/HTML tag for each char) or -paragraph format (CSS/HTML for each line).

I checked so far :

Abbyy Cloud SDK (They don't have PDF-to-HTML service but PDF-to-XML that can be covertable to HTML with XSLT support (maybe). Also OCR service with text output is quite good)
cloudconvert.org (They are providing the same results as Ubuntu pdftohtml command which is based on poppler-Xpdf3.0)
pdftohtml commamd (Tested on Ubuntu) - I got a result with full of < p >.
aspose.PDF (They don't have PDF-to-HTML service in the cloud but they have good integration with GDrive, Dropbox and Amazon s3.
PdfNET of PDFTron : I got a result with complex CSS and HTML structure with almost a tag per character.

My question is if you know any other service worth to try and get structural HTML output for data extraction.

Thanks in advance.

Have you looked at Tesseract's hOCR output? http://stackoverflow.com/questions/15829148/does-tesseracts-hocr-output-really-contain-bounding-boxes-and-confidence-levels — nguyenq, Sep 20 '13 at 22:58
Our cloud service is currently free to try (http://www.idrsolutions.com/cloud-conversion/). Most PDFs do not contain any structure so there is nothing to pass on - we focus on getting it to look correct. — mark stephens, Sep 21 '13 at 06:53

PDF to HTML and OCR solution for information extraction

0 Answers0