Python Mammoth Strange elements within HTML headings

Question

I just found the Mammoth Python package a couple of days ago and its a great tool which really creates clean HTML code from a Word doc. Its nearly perfect. There is just one artifact I don’t understand. The heading elements (h1-h6) it creates from the Word headings contain several <a> elements with strange TOC ids. Looks like this:

<h1><a id="_Toc48228035"></a><a id="_Toc48288791"></a><a id="_Toc48303673"></a><a id="_Toc48306159"></a><a id="_Toc48308644"></a><a id="_Toc48311128"></a><a id="_Toc48313611"></a>Arteriosklerose</h1>

Does anybody know how the get rid of these?

Thanks in advance

Cheers, Peter

score 0 · Answer 1 · answered Aug 20 '20 at 13:20

0

This is just a guess, but I hope it helps:

TOC stands most probably for "Table of Content". When you want to skip to an element in the page, (like a certain Chapter), you give the chapter an ID and append #ID to your url. In this way the browser would scroll directly to that point.

I guess you are using a table of content somehow and it has links in it and when you inspect them you fill find something like <a href="#_Toc48228035">Arteriosklerose</a>

answered Aug 20 '20 at 13:20

cagcoach

625
7
24

Thanks @cagcoach for this hint. It is indeed related to the table of contents. To avoid these ``````element to be included in the HTML headings you have to make sure that the TOC stylesheets are not used in your Word document. So I copied stylesheets from a dotm template into my documents but skipped the TOC1-8 styles. It worked. HTML clean now. Cheers. – Peter Ebel Aug 20 '20 at 14:15
Update: The anchors in the html headings were created due to the fact that the Word document contained bookmarks. Once deleted the anchors also disappeared. Kudos for this hint go to Michael Williamson, creator of the Python mammoth package. – Peter Ebel Aug 23 '20 at 11:55

Python Mammoth Strange elements within HTML headings

1 Answers1