0

I need to read docx files in Python and retrieve the relative positions of multilevel list. See below example:

enter image description here

I want to read the text within the multilevel list only and retrieve the relative position and return a dictionary. Expected output be like:

output = {'1': 'This is the first bullet point.', 
          '1-(a)': 'This is the first sub bullet point.', 
          '1-(b)': 'This is the second sub bullet point.', 
          '1-(b)-(i)': 'My name is Bob.', 
          '1-(b)-(ii)': 'My name is Dave.', 
          '2': 'This is the second bullet point.'
         }

As 'This is a sample document.' and 'End of document.' are not within the multilevel list, these texts shouldn't be included in the dictionary.

I saw some related questions such as this and this but they're different from my requirement. Appreciate your help on my question!

crx91
  • 463
  • 2
  • 7

1 Answers1

0

Since the list numbering styles in python-docx have not yet been implemented, it will be necessary to look inside the Word document. Unpack it and transform it into a readable form.

main.py 1st part
import zipfile
import xml.etree.ElementTree as ET
from pprint import pprint

ns = {'paragraph': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
result = dict()
i, j, k = -1, -1, -1
lev = 0

with open('test.docx', 'rb') as f:
    myzip = zipfile.ZipFile(f)
    xml_content = myzip.read('word/document.xml')

root = ET.fromstring(xml_content)
tree = ET.ElementTree(root)
ET.indent(tree, space="\t", level=0)
tree.write('data.xml', encoding="utf-8")

The resulting xml file shows that if the list is nested, then the <ns0:ilvl ns0:val="1" /> tag shows the nesting depth - 0, 1, 2. The number of lines of each level and their style are not directly visible, it is read and set them manually.

data.xml
..................
<ns0:p ns0:rsidR="368C8D9E" ns0:rsidP="368C8D9E" ns0:rsidRDefault="368C8D9E" ns2:paraId="59FD4FDE" ns2:textId="7D184D8C">
            <ns0:pPr>
                <ns0:pStyle ns0:val="ListParagraph" />
                <ns0:numPr>
                    <ns0:ilvl ns0:val="1" />
................

Therefore, we go over all the paragraphs, select numbered lines and arrange them in a dictionary, in accordance with their nesting.

 main.py 2st part
for paragraph in root.findall('.//paragraph:p', ns):
    if paragraph.find('.//paragraph:numPr', ns):
        elem_level = paragraph.find('.//paragraph:ilvl', ns)
        level = list(elem_level.attrib.values())[0]
        text = paragraph.find('.//paragraph:t', ns).text
        match level:
            case '0':
                i += 1
                lev = str(i)
                j, k = -1, -1
            case '1':
                j += 1
                lev = f'{i}-{j}'
                k = -1
            case '2':
                k += 1
                lev = f'{i}-{j}-{k}'
            case _:
                print('error')
        result[lev] = text
pprint(result)

-------------------------------

{'0': 'This is the first bullet point.',
 '0-0': 'This is the first sub bullet point.',
 '0-1': 'This is the second sub bullet point.',
 '0-1-0': 'My name is Bob.',
 '0-1-1': 'My name is Dave.',
 '1': 'This is the second bullet point.'}
Сергей Кох
  • 1,417
  • 12
  • 6
  • 13