1

I'm looking for a solution for PDF to HTML and OCR service in the cloud or in the SDK format. After my searches, I see that there are bunch of services out there in the internet. I tried some of them and I got some idea. I'd like to know that if any of you use such service.

My biggest concerns are to have a automation structure to have an HTML output that I can use in the information extraction. I'd like to have structured data output like tables. (most of the services provide HTML output with the -character format (CSS/HTML tag for each char) or -paragraph format (CSS/HTML for each line).

I checked so far :

  • Abbyy Cloud SDK (They don't have PDF-to-HTML service but PDF-to-XML that can be covertable to HTML with XSLT support (maybe). Also OCR service with text output is quite good)
  • cloudconvert.org (They are providing the same results as Ubuntu pdftohtml command which is based on poppler-Xpdf3.0)
  • pdftohtml commamd (Tested on Ubuntu) - I got a result with full of < p >.
  • aspose.PDF (They don't have PDF-to-HTML service in the cloud but they have good integration with GDrive, Dropbox and Amazon s3.
  • PdfNET of PDFTron : I got a result with complex CSS and HTML structure with almost a tag per character.

My question is if you know any other service worth to try and get structural HTML output for data extraction.

Thanks in advance.

zafatar
  • 124
  • 1
  • 6
  • Have you looked at Tesseract's hOCR output? http://stackoverflow.com/questions/15829148/does-tesseracts-hocr-output-really-contain-bounding-boxes-and-confidence-levels – nguyenq Sep 20 '13 at 22:58
  • Our cloud service is currently free to try (http://www.idrsolutions.com/cloud-conversion/). Most PDFs do not contain any structure so there is nothing to pass on - we focus on getting it to look correct. – mark stephens Sep 21 '13 at 06:53

0 Answers0