How can I parse a large DOCX file and pick out key words/strings that appear n number of times in python?

Question

I have very large DOCX files that I was hoping to parse through and be able to build a database of sorts that shows the frequency of a word/string in the documents. From what I gather this is definitely not an easy task. I was just hoping for some direction as to a library that I could use to help me with this.

This is an example of what one may look like. The structure isn't consistent so that will complicate things as well. Any direction will be appreciated!!!

Are you bound to C++? We're doing similar things in our pipeline but primarily using Python since Python has Spacy, which is very fast and accurate for NLP. — T. Altena, Jan 09 '20 at 14:52
https://stackoverflow.com/questions/18555064/read-from-word-document-line-by-line — Longoon12000, Jan 09 '20 at 14:53
@T.Altena I guess I am not bound to c#, we have other components for the project in .NET so I considered it primary only because I felt it might be easier down the road. Maybe I will try the python route! Thanks for the tip! — micshapicsha, Jan 09 '20 at 15:00

score 2 · Accepted Answer · answered Jan 09 '20 at 15:36

Python based solution

If (as per your comment) you're able to do this in Python, look at the following snippets:

So first thing to realise is that docx files are actually .zip archives containing a number of XML files. Most text-content will be stored in the word/document.xml. Word does some complicated things with numbered lists, which will require you to also load other XMLs like styles.xml.

The markup of DOCX files can be a pain as the document is structured in w:p (paragraphs) and arbitrary w:r (runs). These runs are basically 'a bit of typing', so it can either be one letter, or a couple of words together.

We use UpdateableZipFile from https://stackoverflow.com/a/35435548. This was primarily because we also wanted to be able to edit the documents, so you could potentially just use snippets from it.

import UpdateableZipFile
from lxml import etree

source_file = UpdateableZipFile(os.path.join(path, self.input_file))
nsmap = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
         'mc': "http://schemas.openxmlformats.org/markup-compatibility/2006",
        } #you might need a few more namespace definitions if you get funky docx inputs

document = source_file.read_member('word/document.xml') #returns the root of an Etree object based on the document.xml xml tree.

# Query the XML element using xpaths (don't use Regex), this gives the text of all paragraph nodes:
paragraph_list = document.xpath("//w:p/descendant-or-self::*/text()", namespaces=self.nsmap)

You can then feed the text to NLP such as Spacy:

import spacy

nlp = spacy.load("en_core_web_sm")
word_counts = {}

for paragraph in paragraph_list:
    doc = nlp(paragraph)
    for token in doc:
        if token.text in word_counts:
            word_counts[token.text]+=1
        else:
            word_counts[token.text]=1

Spacy will tokenize the text for you, and can do lots more in terms of Named Entity Recognition, Parts of Speech tagging etc.

How can I parse a large DOCX file and pick out key words/strings that appear n number of times in python?

1 Answers1

Python based solution