9

i run a job search site, and i need to convert doc, docx and pdf files into HTML on linux CentOS server running php. People submit these files as resumes. So far, I found PHPDocx to be great at converting docx to html. But I am stuck at doc/pdf. PDFTOHTML gives error "bad color" when i run tests. As far as doc, i only found wvwave, which seems complex and bulky to install.

does anyone have any ideas on how to easily convert doc/pdf to HTML?

Nauphal
  • 6,194
  • 4
  • 27
  • 43
sam
  • 317
  • 1
  • 4
  • 8

4 Answers4

5

The only thing i can think of is FPDF. It is intended for creating PDF files in PHP but it can also open PDF files. Maybe you can use that as a base and develop some sort of toHTML function for it.

It is completely free to use and it has some extensions already. It MIGHT help you.

http://www.fpdf.org

EDIT: Thanks for the addition to my post in the comments to Pierre:

You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image.

I havent taken a look at it myself so far but this might help.

Ch33f
  • 609
  • 8
  • 17
  • I know that! But it can also read existing PDF files and I am pretty shure you could develop something that would output HTML using FPDF as the base class! – Ch33f Aug 20 '13 at 12:39
  • 2
    +1 because of the unfair user1914292's downvote, he didn't read the answer and downvoted. But Ch33f, you can't use fdpf as expected. You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi/ but the input pdf is just like an image. – Pierre Aug 26 '13 at 17:55
  • Thanks for the +1 and for the addition to my post, I'll include that in an etid. :) – Ch33f Aug 27 '13 at 13:26
3

As far as .doc files go how about trying OpenOffice/LibreOffice, something like:
lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you're out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert.

tmo
  • 31
  • 2
2

There are various tools out there already to do this, such as http://dag.wieers.com/home-made/unoconv/, http://www.phpdocx.com/ (which you've already tried)

http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/ looks promising.

Or, you could install a portable version of libreoffice on your server which allows command line conversion https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters

I'm sure there'll be tutorials out there (on libreoffice support area)

James
  • 4,644
  • 5
  • 37
  • 48
1

To easily convert pdf to html, I would suggest pdf2htmlEX which produces outstanding HTML and is fast enough for runtime converting. You should first put some effort to optimize and build it for your system. There is simple build howto included on the project link.

Breign
  • 146
  • 8