1

I have requirement where i need to read custom numbering from docx, I am using python-docx 0.8 but that does not fully support numbering. I also tried to read protected property paragraph._p.pPr.numPr but no success. I tried to extract the docx into xml and find relationship between numbering.xml and document.xml. I do not get any success there as well since the numbering.xml has formula and it is apply by style tag i.e. (<w:pPr><w:pStyle w:val="abc"/></w:pPr>) or through numPr (<w:pPr> <w:numPr><w:ilvl w:val="0"/><w:numId w:val="0"/></w:numPr></w:pPr>). I tried to find the relationship with numId with numbering.xml but no luck either.

The numbering in word document is like '[0001] This application'

arvind
  • 106
  • 7
  • Can you put a sample of the file you are trying to read and what you want the output to look like? – BLimitless May 06 '21 at 20:47
  • Sample file : https://easyupload.io/x0fl10 – arvind May 06 '21 at 22:10
  • the output should give me the numbering for each paragraph i.e. [0001], [0002], [0003], [0004] and 1.,2. etc – arvind May 06 '21 at 22:26
  • Okay, I see what you're trying to do. I don't think you can do this off-the-shelf with docx. [This chain](https://stackoverflow.com/questions/52094242/is-there-any-way-to-read-docx-file-include-auto-numbering-using-python-docx) might have what you need. – BLimitless May 07 '21 at 00:17
  • i tired that link but it does not work for my use case. Also i am not looking solution from python-docx library as I know it does not works in this use case. – arvind May 07 '21 at 16:43

1 Answers1

0

Figured it out! The trick is not to try to read the word doc, but to instead convert to a format that python can process more easily:

  1. Open the .docx and click "Save As"... then select .txt
  2. This didn't work on it's own, I had to further select the sub options for UTF-8 encoding. See pic here, and note I'm doing this from a Mac so your save screen may look different.enter image description here

Once that is done, you can then read the new file like a normal txt file, including the custom numbering, and then do whatever natural language processing you want on it to isolate the numbers. Here's my (overly simple) code:

# Create array called 'doc' where each item in array is line from word doc.
with open('TEST_WORD.txt', 'r') as f:
    doc = f.readlines()

# Function to return just the first word from each line if it starts with a number or a bracket.
def return_first_number(string):
    if string[0] in ['1', '2', '3', '4', '5', '6', '7', '8', '9', '[']:
        return string.split()[0]

# Use function to create a list just of lines that start with a number.
cleaned = [return_first_number(item) for item in doc]

# Get rid of all the "None"s. 
cleaned = [item for item in cleaned if item]

# Print final list
print(cleaned)

out: ['[0001]', '1960s', '[0002]', '1960s', '[0003]', '1960s', '[0004]', '1960s', '1.', '2.']

This is by no means complete for what you need. It has some false positives (e.g. a line that starts with 1960s referring to the year), and I only have a sample of the word doc so there's likely other edge cases too. But this should get you on the right path.

Good luck!

BLimitless
  • 2,060
  • 5
  • 17
  • 32
  • docx file is downloaded on linux server using rest api call. The python web application process it. There is no way I can open it manually and save as .txt or .pdf format unless that can be done through python itself. I also try libreoffice to convert into pdf for processing that loose numbering and format of word docx. I cannot use native MS word feature to convert file format. – arvind May 08 '21 at 21:56
  • Try [this](https://stackoverflow.com/questions/62859658/how-to-convert-docx-to-txt-in-python) – BLimitless May 08 '21 at 23:08
  • Thanks a lot but even this library lost formatting and give 1. instead of [0001] – arvind May 08 '21 at 23:55