Python get line position (top, left, bottom, right) from DOCX to HTML

Question

I have a document like this document, I want to convert the DOCX file into HTML file by using Python package docx: docx.convert_to_html(input).

I am wondering is there any method or Python package that I can extract the position from each line? The position represents the boundary of each line in relative location in a page.

I want my ultimate HTML result looks like this:

<line b="38" id="line_0" l="616" r="795" style="white-space:nowrap; position:absolute; left:616px;top:14px; font-size:19.0px; " t="14">DOC
# 2019-0046594<br></line>

Regardless of the format, I need the four position values of "l, r, b, t".

score 0 · Answer 1 · answered Mar 18 '21 at 23:44

You cannot extract positional line rendering information from a DOCX file because DOCX files do not represent such information explicitly. Word wrapping and line rendering algorithms are implementation-dependent.

The same is true of DOCX page numbering in general: See How to access OpenXML content by page number?

Python get line position (top, left, bottom, right) from DOCX to HTML

1 Answers1