24

How can I convert PDF files to HTML with Python?

I was thinking something alone the lines of what Google does (or seems to do) to index PDF files.

My final goal is to setup Apache to show the HTML for the PDF files, so anything leading me in that direction would also be appreciated.

Marcos Lara
  • 1,883
  • 5
  • 18
  • 20

1 Answers1

6

The poppler package provides a pdf2html utility that you might be able to use. There is also a Python binding to libpoppler.

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • The python binding is mostly for rendering PDF in a GTK widget/ui, so I am not sure it would help here. – Ali Afshar Nov 09 '08 at 21:40
  • I haven't actually used it, but it does expose poppler_page_get_text, which might be useful to the OP. – Martin v. Löwis Nov 09 '08 at 21:49
  • Right, but seems a whole big waste of GTK/Glib bindings if that's all the O.P. wants, especially as there are other easier ways that don't depend on a UI toolkit (eg pdf2html you mention). I should say I generally like the bindings, and was the original author. Maybe not in this case though. – Ali Afshar Nov 10 '08 at 11:34