1

I have very large DOCX files that I was hoping to parse through and be able to build a database of sorts that shows the frequency of a word/string in the documents. From what I gather this is definitely not an easy task. I was just hoping for some direction as to a library that I could use to help me with this.

enter image description here

This is an example of what one may look like. The structure isn't consistent so that will complicate things as well. Any direction will be appreciated!!!

micshapicsha
  • 187
  • 3
  • 12
  • 1
    Are you bound to C++? We're doing similar things in our pipeline but primarily using Python since Python has Spacy, which is very fast and accurate for NLP. – T. Altena Jan 09 '20 at 14:52
  • Seems like a job for regex – Bryan Jan 09 '20 at 14:53
  • https://stackoverflow.com/questions/18555064/read-from-word-document-line-by-line – Longoon12000 Jan 09 '20 at 14:53
  • @T.Altena I guess I am not bound to c#, we have other components for the project in .NET so I considered it primary only because I felt it might be easier down the road. Maybe I will try the python route! Thanks for the tip! – micshapicsha Jan 09 '20 at 15:00

1 Answers1

2

Python based solution

If (as per your comment) you're able to do this in Python, look at the following snippets:

So first thing to realise is that docx files are actually .zip archives containing a number of XML files. Most text-content will be stored in the word/document.xml. Word does some complicated things with numbered lists, which will require you to also load other XMLs like styles.xml.

The markup of DOCX files can be a pain as the document is structured in w:p (paragraphs) and arbitrary w:r (runs). These runs are basically 'a bit of typing', so it can either be one letter, or a couple of words together.

We use UpdateableZipFile from https://stackoverflow.com/a/35435548. This was primarily because we also wanted to be able to edit the documents, so you could potentially just use snippets from it.

import UpdateableZipFile
from lxml import etree

source_file = UpdateableZipFile(os.path.join(path, self.input_file))
nsmap = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
         'mc': "http://schemas.openxmlformats.org/markup-compatibility/2006",
        } #you might need a few more namespace definitions if you get funky docx inputs

document = source_file.read_member('word/document.xml') #returns the root of an Etree object based on the document.xml xml tree.

# Query the XML element using xpaths (don't use Regex), this gives the text of all paragraph nodes:
paragraph_list = document.xpath("//w:p/descendant-or-self::*/text()", namespaces=self.nsmap)

You can then feed the text to NLP such as Spacy:

import spacy

nlp = spacy.load("en_core_web_sm")
word_counts = {}

for paragraph in paragraph_list:
    doc = nlp(paragraph)
    for token in doc:
        if token.text in word_counts:
            word_counts[token.text]+=1
        else:
            word_counts[token.text]=1    

Spacy will tokenize the text for you, and can do lots more in terms of Named Entity Recognition, Parts of Speech tagging etc.

T. Altena
  • 752
  • 4
  • 15