Optimal way to convert PDF file to HTML

Question

I am trying to convert thousands of PDF files to HTML. I was able to convert this PDF file to this HTML file using the following code:

def convertPDFToHtml():
    command = 'pdf2txt.py -o output.html -t html test.pdf'
    os.system(command)

I want to be able to parse the HTML file so that I can extract different texts from it. The problem now is that the output HTML file is missing a lot of text from the original file.

Is there a better to convert the PDF file and parse the HTML text ?

Can you please rehost your files? I got some 404 notices and your files are hosted in rather obscure sites. — Enthus3d, Sep 15 '19 at 22:10
I don't think such thing exists. All PDF to HTML converters that do a decent job will create images for each page, and then only use HTML to present those images in an organized way. And as you'd imagine you cannot trivially parse text from an image. — Havenard, Sep 15 '19 at 23:25
There are some fundamental differences between PDF and HTML that make creating a HTML that truly represents the original PDF almost impossible. Maybe using Canvas or SVG you can do it, but not with real HTML. — Havenard, Sep 15 '19 at 23:29
Are you just trying to extract the text from the PDF for further processing? Or are HTML files themselves important for you? Do you use the HTML output for anything else? — Ryan, Sep 16 '19 at 21:50

score 0 · Answer 1 · answered Sep 15 '19 at 23:11

0

This is possibly a similar problem as discussed here, unless you specifically want to generate HTML files. But even so, you could first extract the text from the PDFs as simple unformatted text, parse it, and then generate the HTMLs.

answered Sep 15 '19 at 23:11

s0mbre

361
2
14

Optimal way to convert PDF file to HTML

1 Answers1