Since the list numbering styles in python-docx
have not yet been implemented, it will be necessary to look inside the Word document. Unpack it and transform it into a readable form.
main.py 1st part
import zipfile
import xml.etree.ElementTree as ET
from pprint import pprint
ns = {'paragraph': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
result = dict()
i, j, k = -1, -1, -1
lev = 0
with open('test.docx', 'rb') as f:
myzip = zipfile.ZipFile(f)
xml_content = myzip.read('word/document.xml')
root = ET.fromstring(xml_content)
tree = ET.ElementTree(root)
ET.indent(tree, space="\t", level=0)
tree.write('data.xml', encoding="utf-8")
The resulting xml file shows that if the list is nested, then the <ns0:ilvl ns0:val="1" />
tag shows the nesting depth - 0, 1, 2. The number of lines of each level and their style are not directly visible, it is read and set them manually.
data.xml
..................
<ns0:p ns0:rsidR="368C8D9E" ns0:rsidP="368C8D9E" ns0:rsidRDefault="368C8D9E" ns2:paraId="59FD4FDE" ns2:textId="7D184D8C">
<ns0:pPr>
<ns0:pStyle ns0:val="ListParagraph" />
<ns0:numPr>
<ns0:ilvl ns0:val="1" />
................
Therefore, we go over all the paragraphs, select numbered lines and arrange them in a dictionary, in accordance with their nesting.
main.py 2st part
for paragraph in root.findall('.//paragraph:p', ns):
if paragraph.find('.//paragraph:numPr', ns):
elem_level = paragraph.find('.//paragraph:ilvl', ns)
level = list(elem_level.attrib.values())[0]
text = paragraph.find('.//paragraph:t', ns).text
match level:
case '0':
i += 1
lev = str(i)
j, k = -1, -1
case '1':
j += 1
lev = f'{i}-{j}'
k = -1
case '2':
k += 1
lev = f'{i}-{j}-{k}'
case _:
print('error')
result[lev] = text
pprint(result)
-------------------------------
{'0': 'This is the first bullet point.',
'0-0': 'This is the first sub bullet point.',
'0-1': 'This is the second sub bullet point.',
'0-1-0': 'My name is Bob.',
'0-1-1': 'My name is Dave.',
'1': 'This is the second bullet point.'}