Read custom numbering from docx using python

Question

I have requirement where i need to read custom numbering from docx, I am using python-docx 0.8 but that does not fully support numbering. I also tried to read protected property paragraph._p.pPr.numPr but no success. I tried to extract the docx into xml and find relationship between numbering.xml and document.xml. I do not get any success there as well since the numbering.xml has formula and it is apply by style tag i.e. (<w:pPr><w:pStyle w:val="abc"/></w:pPr>) or through numPr (<w:pPr> <w:numPr><w:ilvl w:val="0"/><w:numId w:val="0"/></w:numPr></w:pPr>). I tried to find the relationship with numId with numbering.xml but no luck either.

The numbering in word document is like '[0001] This application'

Can you put a sample of the file you are trying to read and what you want the output to look like? — BLimitless, May 06 '21 at 20:47
the output should give me the numbering for each paragraph i.e. [0001], [0002], [0003], [0004] and 1.,2. etc — arvind, May 06 '21 at 22:26
Okay, I see what you're trying to do. I don't think you can do this off-the-shelf with docx. [This chain](https://stackoverflow.com/questions/52094242/is-there-any-way-to-read-docx-file-include-auto-numbering-using-python-docx) might have what you need. — BLimitless, May 07 '21 at 00:17
i tired that link but it does not work for my use case. Also i am not looking solution from python-docx library as I know it does not works in this use case. — arvind, May 07 '21 at 16:43

BLimitless · Answer 1 · 2021-05-08T05:47:06.740

Figured it out! The trick is not to try to read the word doc, but to instead convert to a format that python can process more easily:

Open the .docx and click "Save As"... then select .txt
This didn't work on it's own, I had to further select the sub options for UTF-8 encoding. See pic here, and note I'm doing this from a Mac so your save screen may look different.

Once that is done, you can then read the new file like a normal txt file, including the custom numbering, and then do whatever natural language processing you want on it to isolate the numbers. Here's my (overly simple) code:

# Create array called 'doc' where each item in array is line from word doc.
with open('TEST_WORD.txt', 'r') as f:
    doc = f.readlines()

# Function to return just the first word from each line if it starts with a number or a bracket.
def return_first_number(string):
    if string[0] in ['1', '2', '3', '4', '5', '6', '7', '8', '9', '[']:
        return string.split()[0]

# Use function to create a list just of lines that start with a number.
cleaned = [return_first_number(item) for item in doc]

# Get rid of all the "None"s. 
cleaned = [item for item in cleaned if item]

# Print final list
print(cleaned)

out: ['[0001]', '1960s', '[0002]', '1960s', '[0003]', '1960s', '[0004]', '1960s', '1.', '2.']

This is by no means complete for what you need. It has some false positives (e.g. a line that starts with 1960s referring to the year), and I only have a sample of the word doc so there's likely other edge cases too. But this should get you on the right path.

Good luck!

docx file is downloaded on linux server using rest api call. The python web application process it. There is no way I can open it manually and save as .txt or .pdf format unless that can be done through python itself. I also try libreoffice to convert into pdf for processing that loose numbering and format of word docx. I cannot use native MS word feature to convert file format. — arvind, May 08 '21 at 21:56
Try [this](https://stackoverflow.com/questions/62859658/how-to-convert-docx-to-txt-in-python) — BLimitless, May 08 '21 at 23:08
Thanks a lot but even this library lost formatting and give 1. instead of [0001] — arvind, May 08 '21 at 23:55

Read custom numbering from docx using python

1 Answers1