0

I have a bunch of .doc files (not .docx) and I want to convert them into HTML files.

I tried Python's docx2html module too but it only supports .docx files and not doc.

So how can I achieve it?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
ofnowhere
  • 1,114
  • 2
  • 16
  • 40
  • It should be fairly simple to just convert all of your .doc files to .docx files with COMs (if you're on Windows) – wnnmaw Jun 19 '14 at 14:19
  • 1
    @wnnmaw Can you please elaborate exactly how it can be achieved? – ofnowhere Jun 19 '14 at 14:23
  • 1
    [This question](http://stackoverflow.com/questions/6011115/doc-to-pdf-using-python) covers how to convert a .doc to a .pdf, which you should be able to adapt to convert to .doc by replacing ```wdFormatPDF``` with the appropriate constant from [here](http://msdn.microsoft.com/en-us/library/office/bb238158(v=office.12).aspx) – wnnmaw Jun 19 '14 at 14:37
  • wdFormatDocument97 and 0 value are the ones you're looking for – heinst Jun 19 '14 at 14:58
  • First convert them to docx. Then use this library: https://github.com/mwilliamson/python-mammoth – guettli Mar 29 '20 at 18:02
  • Related: [How do you convert a Word Document into very simple html in Python?](https://stackoverflow.com/q/1596911/6045800) – Tomerikoo Aug 11 '21 at 14:18
  • Simply convert your doc files to docx. You might wanna have a look at this. https://stackoverflow.com/questions/1596911/how-do-you-convert-a-word-document-into-very-simple-html-in-python – Naman Jun 19 '14 at 15:16

1 Answers1

0

I solved it by calling LibreOffice's soffice into my Python module using subprocess.call. With soffice, you can directly convert doc to html.

But I must inform that with this solution, outputfile.html may lose some of the format styles. In my case, it preserved the fonts face, font size and runs (bold, italic etc.) which were the essentials for me.

import subprocess

# Assuming `filename` has already been assigned for input file name
subprocess.call(['soffice', '--headless', '--convert-to', 'html', filename])

This will generate an html document with the same name and in the same directory.

You can then go ahead and re-style the .html file with some CSS if it's necessary.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Ruslan
  • 162
  • 2
  • 13