Python based solution
If (as per your comment) you're able to do this in Python, look at the following snippets:
So first thing to realise is that docx files are actually .zip archives containing a number of XML files. Most text-content will be stored in the word/document.xml
. Word does some complicated things with numbered lists, which will require you to also load other XMLs like styles.xml
.
The markup of DOCX files can be a pain as the document is structured in w:p (paragraphs) and arbitrary w:r (runs). These runs are basically 'a bit of typing', so it can either be one letter, or a couple of words together.
We use UpdateableZipFile from https://stackoverflow.com/a/35435548. This was primarily because we also wanted to be able to edit the documents, so you could potentially just use snippets from it.
import UpdateableZipFile
from lxml import etree
source_file = UpdateableZipFile(os.path.join(path, self.input_file))
nsmap = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
'mc': "http://schemas.openxmlformats.org/markup-compatibility/2006",
} #you might need a few more namespace definitions if you get funky docx inputs
document = source_file.read_member('word/document.xml') #returns the root of an Etree object based on the document.xml xml tree.
# Query the XML element using xpaths (don't use Regex), this gives the text of all paragraph nodes:
paragraph_list = document.xpath("//w:p/descendant-or-self::*/text()", namespaces=self.nsmap)
You can then feed the text to NLP such as Spacy:
import spacy
nlp = spacy.load("en_core_web_sm")
word_counts = {}
for paragraph in paragraph_list:
doc = nlp(paragraph)
for token in doc:
if token.text in word_counts:
word_counts[token.text]+=1
else:
word_counts[token.text]=1
Spacy will tokenize the text for you, and can do lots more in terms of Named Entity Recognition, Parts of Speech tagging etc.