I have a bunch of .doc files (not .docx) and I want to convert them into HTML files.
I tried Python's docx2html
module too but it only supports .docx files and not doc.
So how can I achieve it?
I have a bunch of .doc files (not .docx) and I want to convert them into HTML files.
I tried Python's docx2html
module too but it only supports .docx files and not doc.
So how can I achieve it?
I solved it by calling LibreOffice's soffice
into my Python module using subprocess.call
. With soffice, you can directly convert doc
to html
.
But I must inform that with this solution, outputfile.html
may lose some of the format styles.
In my case, it preserved the fonts face, font size and runs (bold, italic etc.) which were the essentials for me.
import subprocess
# Assuming `filename` has already been assigned for input file name
subprocess.call(['soffice', '--headless', '--convert-to', 'html', filename])
This will generate an html document with the same name and in the same directory.
You can then go ahead and re-style the .html
file with some CSS if it's necessary.