25

PDF.js is the latest library from Mozilla, and is a standards-based PDF renderer that is written entirely in Javascript. Currently you cannot access the generated HTML, and the library can only be used as a viewer. Is it possible to use PDF.js to statically convert a PDF to its HTML equivalent? Considering it renders in a browser, it must be HTML+CSS, and the JS would be used only for navigation.

After converting it to HTML I plan to use our existing HTML workflow to import/index/consume the page as if it were an ordinary HTML webpage.

Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607

4 Answers4

19

Note: this is for the original question, as well as for others who may be visiting this for related help, as was the case with me. ;)

Answer:
You may try: Poppler or pdf2htmlEX which is based on Poppler.

I'd recommend looking at the pdf2htmlEX documentation it also has as very good comparison table.

Asad Malik
  • 763
  • 2
  • 8
  • 24
  • I ran in a number of conversion issues with the above (see http://goo.gl/kYyhcQ) and have switched to DocPub http://goo.gl/DKbmq1 –  Mar 14 '14 at 21:37
8

pdf.js renders to Canvas so it can't be used to statically convert a PDF to HTML

Ika
  • 402
  • 2
  • 14
  • While true, that `pdf.js` uses canvas to create some of the output and there is not direct rendering as HTML, which might really be the better choice. It is also true that there is **more** rendered than only what ends up in the Canvas element (i.e. there are text elements rendered which allow the user to highlight text, search text etc.) Even though not simple, it is perfectly conceivable to hava a script render each page to canvas export it to images and bundle it with the html text elements to have a more HTML output – humanityANDpeace Oct 16 '18 at 08:24
  • this would loose all vector properties of PDF ... including ability to search or select text. So it possible but unlikely a working solution – Ika Feb 25 '19 at 05:57
2

DocPub is powered by PDFNet, a PDF SDK with C# support, which supports converting PDF to HTML offline.

WebViewer from the same company is an HTML5-based PDF viewer that renders documents on-the-fly within the browser.

WebViewer works with all major Web platforms; the viewer can be directly embedded and customized within any HTML5, Silverlight, or Flash application. The content can be instantly accessed from any system or device - including iPad/iPhone (iOS), Android, Windows (desktop & tablets), WP8, Linux, Mac, etc. -- demo

Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607
  • DocPub does not work well for some signs, https://github.com/coolwanglu/pdf2htmlEX works perfectly in my case – BartusZak Nov 18 '20 at 09:42
-1

AccuSoft has an HTML5-based PDF/DOC viewer called Prizm. I don't think this can convert the PDF statically to HTML, but it looks like a functional HTML5-based viewer. I have no experience with it, but the online HTML5 demo (the link) looks pretty impressive. They claim it can be used on PC & Mobile for great rendering of such files.

Accusoft HTML5 viewing technology can display virtually any document file—DOC, PDF, PPT, CAD and dozens more—through the native browser on almost any smartphone or tablet, with no additional apps or players required on users’ devices.

Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607