26

Is there a .dll I can use which uses a PDF file as an input and HTML file as an output? I want to convert from PDF to HTML. My colleague says that it's very difficult going step by step, getting text/font/image/margins/links etc. from PDF and then creating new HTML file with the same content. He says it's nearly impossible. So I was thinking - if there's some dll which I can use as a reference to do that?

wonea
  • 4,783
  • 17
  • 86
  • 139
petko_stankoski
  • 10,459
  • 41
  • 127
  • 231
  • It's complicated for sure, but why do you want it? – Thanh Nguyen Nov 14 '11 at 15:28
  • there are several html to pdf converter tools which vendors offers, but I don't saw any pdf to html. As I don't know the full version can export to html you should check this first out and see the results. Then maybe you can realize some batch jobs that use acrobat todo it. Just an idea... – YvesR Nov 14 '11 at 15:28
  • 1
    A web search for "convert pdf to html" will gather many possible solutions. SO is not a good place for product suggestions, therefore voting to close as "not constructive". – Richard Nov 14 '11 at 15:29
  • copy pdf contents into word then save as html. – Dustin Davis Nov 14 '11 at 15:32
  • See this post for a basic start on text extraction using iTextSharp http://stackoverflow.com/questions/6882098/how-can-i-get-text-formatting-with-itextsharp – Chris Haas Nov 14 '11 at 19:15
  • 12
    These close-fanatics are going to destroy SO... I would understand this question to be closed as duplicate, since it has been asked a few times, but not constructive? really??? There are thousands of questions like this one (and worse) in SO that has been considered valid. Are you going to close now all requests for libraries that solve problem X? – yms Nov 14 '11 at 21:18
  • Here is a very old duplicate of this question: http://stackoverflow.com/questions/1638937/how-can-i-convert-pdf-to-html – yms Nov 16 '11 at 02:43

3 Answers3

12

Writing a program to do it is definitely not trivial. If you don't find any .NET Library to do this (I couldn't, at least not free), I would just download this and invoke it programmatically to get my html.

If you have the time to spare and/or PDFToHtml does not produce acceptable output for you, you could use iText to write the program yourself. It's a very mature free pdf library. I've used it in the past to manipulate PDFs (merge, create, etc).

UPDATE

As noted in the comment by Quandary, the PDFSharp library offers a more relaxed license (MIT) compared to the Commercial or AGPL license offered by iText. Keep this is mind when choosing your library. I have not used the PDFSharp library myself and I don't know how they compare in terms of functionality.

Icarus
  • 63,293
  • 14
  • 100
  • 115
  • 1
    If anybody does this, better use pdfsharp, it has the better license. – Stefan Steiger Apr 28 '14 at 10:37
  • 12
    On the PDFSharp FAQ they state that their library doesn't convert PDF to HTML and they have no plans to support it. http://www.pdfsharp.net/wiki/pdfsharpfaq.ashx#Can_I_use_PDFsharp_to_convert_PDF_to_Word_RTF_HTML_11 – The Muffin Man Oct 07 '15 at 16:46
8

You can download this free tool: PDFToHTML

Then in your program just fork a new process and run the executable passing the PDF file. I just tested it now and it seems to work ok.

Tudor
  • 61,523
  • 12
  • 102
  • 142
6

If you don't mind paying, Aspose offers a very good solution, this is what we use at my company.

http://www.aspose.com/categories/.net-components/aspose.pdf-for-.net/key-features.aspx

Calum
  • 1,889
  • 2
  • 18
  • 36
  • We, too. In addition, recently the product [Spire](http://www.e-iceblue.com/) showed up, providing similar tools to Aspose. – Uwe Keim Nov 14 '11 at 15:33
  • 2
    aspose doesn't work as easily as advertise and the resulting html is really bad, plus if you need in memory conversion not to file, you need to convert to doc first then doc to html – LemonCool Aug 17 '18 at 15:53