How to convert PDF to HTML?

Question

Is there a proper library which I can use to convert PDF to HTML or some other format that can be converted to HTML easily?

I searched similar questions, but to no luck.

I want to be able to extract text from PDF's, possibly images. I'm not looking to embed the PDF inside the HTML.

I know this was a long time ago, but if don't mind, what did you end up using? — boggy, Dec 11 '19 at 18:53
To the people still visiting, try [pdf2htmlEX](https://github.com/coolwanglu/pdf2htmlEX) — Bamwani, Oct 26 '22 at 20:30

score 28 · Answer 1 · edited Mar 09 '21 at 13:09

28

If you're on Linux, try pdftohtml:

sudo apt-get install poppler-utils
pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html

On MacOS (with homebrew) pdftohtml can be installed with:

brew install pdftohtml

The open source ebook converter Calibre can also convert PDF files to HTML and is available on MacOS, Windows and Linux.

edited Mar 09 '21 at 13:09

ccpizza

28,968
18
162
169

answered Nov 27 '16 at 22:37

moof2k

1,678
1
17
19

6

Please note all layout will be gone. – Bjorn Reppen Nov 08 '20 at 18:02
1

is there anyway to inline the images so I don't need to host jpgs? – chovy Dec 16 '20 at 13:26
2

@chovy supply -dataurls option to generate inline images, supply -c to generate complex html with each page of pdf on separate page of html with layout of the page more or less the same, I noticed images on each page and boxes and other decorations are generated as image used as background while texts are extracted and shown in front of the background image, making the layout more or less the same, with some minor overlapping, however, the result is quite interesting, example use: pdftohtml -dataurls -c pdf_file_with_bookmarks.pdf sample_output.html – Gulshan Chaurasia Mar 16 '21 at 13:42
how to install on arch? – chovy Aug 15 '21 at 06:21
Pdftohtml effectively only extracts text from the PDF. All formatting/font/colors are removed, which makes this tool fairly useless as a "converter". HTML is a lot more than just text. Calibre unfortunately does the same thing. – Cerin Mar 23 '23 at 18:40

score 6 · Accepted Answer · answered Jun 07 '12 at 06:27

6

Like I mentioned in the comment above, it is definitely possible to convert pdf to html using the tool Able2Extract7 which can be downloaded from here

I have been using this tool for almost 2 years now and I am pretty happy with it. This tool lets you convert PDF to Word, Excel, PowerPoint, Publisher, HTML, OO etc. See screenshot

enter image description here

Imp Note: This tool is not a freeware.

HTH

answered Jun 07 '12 at 06:27

Siddharth Rout

147,039
17
206
250

1

This tool is good at accurately converting pdf to .html or .docx. I use it with Calibre to pre-process a .pdf file into .html or .docx, so it will render correctly on my eReader (Kindle or Sony). – Contango Feb 20 '13 at 12:57
Actually, at http://www.pdf.investintech.com/ they allow you to convert a PDF to HTML online.. I tried with a research paper, and the conversion was pretty accurate, except for Mathematical formulae. One drawback though is that it's not very smart meaning that, for example, each line is wrapped into a new div absolutely positioned. – aercolino Aug 09 '14 at 13:03
2

Why is every response to this question on stackoverflow almost like an advert for a paid for solution? – Jay Croghan Nov 02 '20 at 04:31
@JayCroghan Way back in 2012, there were actually no reliable freewares. – Siddharth Rout Nov 02 '20 at 04:53
1

@SiddharthRout it seems even now there aren't really any great freeware for this. – Jay Croghan Nov 03 '20 at 05:02

score 3 · Answer 3 · edited Nov 23 '16 at 21:55

Download

pdfbox-2.0.3.jar
fontbox-2.0.3.jar
preflight-2.0.3.jar
xmpbox-2.0.3.jar
pdfbox-tools-2.0.3.jar
pdfbox-debugger-2.0.3.jar

from http://pdfbox.apache.org/

 import java.io.InputStream;
 import java.io.IOException;
 import org.apache.pdfbox.pdmodel.PDDocument;
 import org.apache.pdfbox.tools.PDFText2HTML;

    // .....
    try {
        InputStream is = // ..... Read PDF file
        PDDocument pdd = PDDocument.load(is); //This is the in-memory representation of the PDF document.
        PDFText2HTML converter = new PDFText2HTML(); // the converter
        String html = converter.getText(pdd); // That's it!
        pdd.close();
        is.close();
    } catch (IOException ioe) {
        // ......
    }

Please note: Images do not get pushed to the HTML output.

This library seems to work better - but it produces invalid, unparsable HTML. This is quite disappointing for such an Apache project. — Regis May, Jul 07 '19 at 08:21

score 3 · Answer 4 · edited Jun 09 '17 at 00:15

It's not that difficult to convert PDF to HTML. There are many online options, which may, however, expose your data to third parties. Follow these steps, and the output is great.

Open the PDF2HTMLEX page. (You can either follow to next steps which i have mentioned, or follow the directions from the page.)
The package is available for download for Windows from here.

From the many options available, I recommend downloading "pdf2htmlEX-win32-0.14.6-upx-with-poppler-data.zip (pdf2htmlEx.exe is packed with UPX)"
After downloading and un-zipping conversion is just one cmd command away.
```
C:\Users\kjk\Downloads\pdf2htmlEX-win32-0.14.6-upx-with-poppler-data>pdf2htmlEX.exe c:\1\abc.pdf
```
Final Command:
```
pdf2htmlEX.exe c:\1\abc.pdf
```
(You can of course shorten the name of the folder, however, I kept it the same as you would see after un-zipping the download. I am assuming you can change the directory in cmd to the desired folder or else Google how.)

abc.pdf will be converted to HTML and will be saved as abc.html in the same folder as that of your exe.

on mac you can use `brew install pdf2htmlEX` – ccpizza Mar 09 '21 at 13:28 — ccpizza, Mar 09 '21 at 13:28
Or with macports it's `sudo port install pdf2htmlex` – Rich Mar 20 '23 at 03:38 — Rich, Mar 20 '23 at 03:38

score 1 · Answer 5 · edited Jul 06 '20 at 19:32

1

Yeah it definitely is possible. If your on ubuntu linux

apt-get install pdftohtml

then

pdftohtml myFile.pdf myFile.htm -c -noframes

If you want to see what all the flags mean then just type

pdftohtml

If your not on linux, there are a plethora of tools out there that you can use to make this happen.

edited Jul 06 '20 at 19:32

jfearn

3
3

answered Mar 08 '12 at 18:40

Samir Patel

805
7
9

7

Wrong direction, the question is about pdf to html – yms Jun 07 '12 at 14:29
2

A bit late, but looking at the parameters, it seems that OP meant `pdftohtml` – eithed Dec 08 '15 at 15:45
It looks like `pdftohtml` is also available on Windows via TeX Live: https://www.tug.org/texlive/ – Paul Wintz Feb 03 '22 at 04:19

score -1 · Answer 6 · answered Mar 03 '21 at 07:00

Here is one possibility with Linux pdfgrep and sed

sudo apt install pdfgrep

pdfgrep  .yourdoc.pdf | sed '/^$/d'| sed -e 's/^%%/<p>%%/'| sed -e 's/^--/<p>--/' | sed -e 's/--$/--<p>/'> yourdoc.html

To format properly, you need to specify the regular expressions accordingly with sed.

How to convert PDF to HTML?

6 Answers6

Linked