Read the content of a Word document via its XML

Question

Context

I am trying to build a Word document browser in Excel to sift trough a large amount of documents (around 1000).

The process of opening a word document proves to be rather slow (around 4 seconds per documents, so in this case it takes 2 hour to look through all the items, which is far too slow for a single query), even by disabling all things that could slow down the opening, hence I open:

As read only
Without the open and repair mode (which can happen on some documents)
Disabling the display of the document

My attempt so far

These documents are tricky to look through because some keywords do appear every single time but not in the same context (not the core of the problem here since I can handle that when the text is loaded in arrays). Hence the often used Windows explorer solution (like in this link ) cannot be used in my case.

For the moment, I managed to have a working macro that analyze the content of the word documents by opening them.

Code

Here is a sample of the code. Note that I used the Microsoft Word 14.0 Object Library reference

' Analyzing all the word document within the same folder '
Sub extractFile()

Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document

Set oWordApp = CreateObject("Word.Application")

sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"

With Application
    .DisplayAlerts = False
    .ScreenUpdating = False
End With

ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"


While (sAnalyzedDoc <> "")
    ' Analyzing documents only with the .doc and .docx extension '
    If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
        ' Opening the document as mentionned above, in read only mode, without repair and invisible '
        Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
        With oDoc
            For i = 1 To .Sentences.Count
                ' Searching for the keyword within the document '
                If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
                    If Not IsEmpty(aOut(1, 2)) Then
                        ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
                    End If
                    aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
                    aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
                    GoTo closingDoc ' A dubious programming choice but that works for the moment '
                End If
            Next i
closingDoc:
            ' Intending to make the closing faster by not saving the document '
            .Close SaveChanges:=False
        End With
    End If
    'Moving on to the next document '
    sAnalyzedDoc = Dir
Wend

exitSub:
With Output
    .Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With

With Application
    .DisplayAlerts = True
    .ScreenUpdating = True
End With

End Sub

My question

The idea I thought was to go via the XML content within the document to access directly to its content (which you can access when renaming any document in newer versions of Word, with a .zip extension and going for nameOfDocument.zip\word\document.xml).

It would be a lot faster than loading all the images, charts and tables of the word document which are of no use in a text search.

Thus, I wanted to ask if there was a way in VBA to open a word document like a zip file and access that XML document to then process it like a normal string of characters in VBA, since I already have the path and the name of the file given the above code.

You can access zipped files directly via a Shell object (http://www.rondebruin.nl/win/s7/win002.htm), but then you're going to be stuck parsing XML (http://stackoverflow.com/questions/11305/how-to-parse-xml-using-vba) and Word has a horrible underlying xml to work with. Good luck. — Mikegrann, Aug 24 '16 at 14:33
Take a look at [VBA macro to search a folder for a keyword](http://stackoverflow.com/a/20856265/4717755). By using the `FindFiles` function as described (use the second version), you will leverage the Windows index of all the words in your documents. — PeterT, Aug 24 '16 at 14:36
Thanks to you both, I'll have a look at the links and try to make something. — Pierre Chevallier, Aug 24 '16 at 15:03
Ok so far I've concluded that what I wanted to do (which is editing the .docx without changing the extension) cannot be done in VBA. I am currently writing a DLL in C# that might solve the matter similar to the [code found on the MSDN](https://msdn.microsoft.com/en-us/library/office/ff478255.aspx) I hope to post something shortly about it. — Pierre Chevallier, Aug 26 '16 at 13:03

score 2 · Accepted Answer · edited Oct 02 '19 at 10:17

Do note that this is not an easy answer to the above problem and the sole VBA code in my initial question will do perfectly the job as long as you do not have a load of documents to browse, else go for another tool (there is a Python Dynamic Link Library (DLL) that does that very well).

Ok, I'll try to make my answer as explanatory as possible.

First of all this question lead me to the infinite journey of XML in C# and in XPath which I chose not to pursue at some point.

It reduced the time of analyzing the files from roughly 2 hours to 10 seconds.

Context

The backbone of reading XML documents, and therefore inner word XML documents, is the OpenXML library from Microsoft. Keep in mind what I said above, that the method I was trying to implement cannot be done solely in VBA and thus must be done in another way. This is probably due to the fact that VBA is implemented for Office and thus limited in accessing the core structure of Office documents, but I have no information relating to this limitation (any information is welcomed).

The answer I will give here is writing a C# DLL for VBA. For writing DLL in C# and referencing to it in VBA I redirect you toward the following link which will resume in a better way this specific process: Tutorial for creating DLL in C#

Let's start

First of all you will need to reference the WindowsBase library and the DocumentFormat.OpenXML in your project to make the solution work as explained in this MSDN article Manipulate Office Open XML Formats Documents and that one Open and add text to a word processing document (Open XML SDK) These articles explain broadly how works the OpenXML library for manipulating word documents.

The C# code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;

namespace BrowserClass
{

    public class SpecificDirectory
    {

        public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
        {
            string sKeyWord = nameKeyword;
            string sStopWord = nameStopword;
            string sDirectory = nameDirectory;

            sStopWord = sStopWord.ToLower();
            sKeyWord = sKeyWord.ToLower();

            string sDocPath = Path.GetDirectoryName(sDirectory);
            // Looking for all the documents with the .docx extension
            string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
            string[] sDocumentList = new string[1];
            string[] sDocumentText = new string[1];

            // Cycling the documents retrieved in the folder
            for (int i = 0; i < sDocName.Count(); i++)
            {
                string docWord = sDocName[i];

                // Opening the documents as read only, no need to edit them
                Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);

                const String officeDocRelType = @"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

                PackagePart corePart = null;
                Uri documentUri = null;

                // We are extracting the part with the document content within the files
                foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
                {
                    documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
                    corePart = officePackage.GetPart(documentUri);
                    break;
                }

                // Here enter the proper code
                if (corePart != null)
                {
                    string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
                    string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
                    string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";

                    // Construction of a namespace manager to handle the different parts of the xml files
                    NameTable nt = new NameTable();
                    XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
                    nsmgr.AddNamespace("dc", dcPropertiesSchema);
                    nsmgr.AddNamespace("cp", cpPropertiesSchema);
                    nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);

                    // Loading the xml document's text
                    XmlDocument doc = new XmlDocument(nt);
                    doc.Load(corePart.GetStream());

                    // I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
                    string docInnerText = doc.DocumentElement.InnerText;
                    docInnerText = docInnerText.Replace("\\* MERGEFORMAT", ".");
                    docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
                    docInnerText = docInnerText.Replace("Glossary.", "");

                    try
                    {
                        Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
                        Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);

                        if (iPosStopWord == -1)
                        {
                            iPosStopWord = docInnerText.Length;
                        }

                        if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
                        {
                            // Redimensions the array if there was already a document loaded
                            if (sDocumentList[0] != null)
                            {
                                Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
                                Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
                            }
                            sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
                            // Taking the small context around the keyword
                            sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
                        }

                    }
                    catch (ArgumentOutOfRangeException)
                    {
                        Console.WriteLine("Error reading inner text.");
                    }
                }
                // Closing the package to enable opening a document right after
                officePackage.Close();
            }

            if (sDocumentList[0] != null)
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[sDocumentList.Length, 2];

                for (int i = 0; i < sDocumentList.Length; i++)
                {
                    sFinalArray[i, 0] = sDocumentList[i].Replace("\\", "");
                    sFinalArray[i, 1] = sDocumentText[i];
                }
                return sFinalArray;
            }
            else 
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[1, 1];
                sFinalArray[0, 0] = "NO MATCH";
                return sFinalArray;
            }
        }
    }

}

The VBA code associated

Option Explicit

Const sLibname As String = "C:\pathToYourDocuments\"

Sub tester()

Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory

' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)

' Add here any necessary processing if needed '

End Sub

So in the end you get a tool that can scan .docx documents much faster than in a classic open-read-close approach in VBA at the cost of more code writing.

Above all you get a simple solution for your users that just want to perform simple search, especially when there is a huge number of word documents.

Note

Parsing Word .XML files can be nightmarish in VBA as pointed out by @Mikegrann . Thankfully OpenXML has an XML parser C# , xml parsing. get data between tags that will do the work for you in C# and take the <w:t></w:t> tags that are refering to the text of the docment. Though I found these answers so far but couldn't make them work: Parsing a MS Word generated XML file in C# , Reading specific XML elements from XML file

So I went for the .InnerText solution I provided with my code above, to access the internal text, at the cost of having some formatting text input (like \\MERGEFORMAT).

Yes indeed I forgot to add that. Now the query takes approximately 10 seconds to execute for all files, but it is restricted to anlayzing `.doc` files — Pierre Chevallier, Sep 05 '16 at 13:44
Great! Consider posting a C# question and have the gurus show how you can integrate *.docx* files. — Parfait, Sep 05 '16 at 22:19
Damn it, I just came back to see that I mentioned `.doc` when in fact this answer applies only to `.docx` files, because `.doc` files have a different way of storing the data. Apologies, — Pierre Chevallier, Nov 11 '16 at 10:53