Cannot convert DOCX to HTML with Python

Question

I've tried it by using mammoth:

import mammoth

result = mammoth.convert_to_html("MyDocument.docx")
print (result.value)

I don't get an HTML, but this strange code:

kbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvB[...]

I've also tried to use docx2html, but I can't install it. When I run pip install docx2html I get this error:

SyntaxError: Missing parentheses in call to 'print'

score 7 · Accepted Answer · answered Dec 20 '17 at 12:31

Mammoth .docx to HTML converter

Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.

There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.

The following features are currently supported:

Headings.
Lists.
Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
Footnotes and endnotes.
Images.
Bold, italics, underlines, strikethrough, superscript and subscript.
Links.
Line breaks.
Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
Comments.

Installation

pip install mammoth

Basic conversion

To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:

import mammoth

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value # The generated HTML
    messages = result.messages # Any messages, such as warnings during conversion

You can also extract the raw text of the document by using mammoth.extract_raw_text. This will ignore all formatting in the document. Each paragraph is followed by two newlines.

with open("document.docx", "rb") as docx_file:
    result = mammoth.extract_raw_text(docx_file)
    text = result.value # The raw text
    messages = result.messages # Any messages

Thanks, I've tested the first code and I get the following response when I print the value of `html`: `bxPdWskbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvfiP8AFjw54d8SeF/HXgW58bReK9d0u403w[...]` I need the HTML tags. Is that possible? With the second code I can get the text, but I still needing the HTML tags :S — yisus, Dec 21 '17 at 15:32

score 1 · Answer 2 · answered Mar 13 '20 at 06:13

1

You can use pypandoc module for that purpose. See below code

import pypandoc output = pypandoc.convert_file('file.docx', 'docx', outputfile="file_converted.html")

answered Mar 13 '20 at 06:13

Avinash Thombre

194
1
5

score 1 · Answer 3 · answered Aug 19 '20 at 09:36

The issue you're having is probably that mammoth doesn't create legit HTML files, just HTML snippets. Meaning it's missing the and tags. Some browsers can still render the content from the file since they're advanced enough to do so, but I ran into a similar problem when trying to use the raw output. A nifty workaround for this is to add this to your code to convert it to proper HTML files:

import mammoth

with open("test.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value  # The generated HTML
    messages = result.messages  # Any messages,

    full_html = (
        '<!DOCTYPE html><html><head><meta charset="utf-8"/></head><body>'
        + html
        + "</body></html>"
    )

    with open("test.html", "w", encoding="utf-8") as f:
        f.write(full_html)

Where test.html is whatever the title you gave to your document.

I'm not taking credit for this, I found it here as well, but can't find the source post.

score 0 · Answer 4 · answered Dec 20 '17 at 12:27

0

As stated in the documentation:

To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:

import mammoth

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value # The generated HTML
    messages = result.messages # Any messages, such as warnings during conversion

answered Dec 20 '17 at 12:27

fredtantini

15,966
8
49
55

Thanks, I've tested the first code and I get the following response when I print the value of `html`: `bxPdWskbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvfiP8AFjw54d8SeF/HXgW58bReK9d0u403w[...]` I need the HTML tags. Is that possible? – yisus Dec 21 '17 at 15:33

Cannot convert DOCX to HTML with Python

4 Answers4