Context
I am trying to build a Word document browser in Excel to sift trough a large amount of documents (around 1000).
The process of opening a word document proves to be rather slow (around 4 seconds per documents, so in this case it takes 2 hour to look through all the items, which is far too slow for a single query), even by disabling all things that could slow down the opening, hence I open:
- As read only
- Without the open and repair mode (which can happen on some documents)
- Disabling the display of the document
My attempt so far
These documents are tricky to look through because some keywords do appear every single time but not in the same context (not the core of the problem here since I can handle that when the text is loaded in arrays). Hence the often used Windows explorer
solution (like in this link ) cannot be used in my case.
For the moment, I managed to have a working macro that analyze the content of the word documents by opening them.
Code
Here is a sample of the code.
Note that I used the Microsoft Word 14.0 Object Library
reference
' Analyzing all the word document within the same folder '
Sub extractFile()
Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document
Set oWordApp = CreateObject("Word.Application")
sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"
With Application
.DisplayAlerts = False
.ScreenUpdating = False
End With
ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"
While (sAnalyzedDoc <> "")
' Analyzing documents only with the .doc and .docx extension '
If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
' Opening the document as mentionned above, in read only mode, without repair and invisible '
Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
With oDoc
For i = 1 To .Sentences.Count
' Searching for the keyword within the document '
If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
If Not IsEmpty(aOut(1, 2)) Then
ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
End If
aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
GoTo closingDoc ' A dubious programming choice but that works for the moment '
End If
Next i
closingDoc:
' Intending to make the closing faster by not saving the document '
.Close SaveChanges:=False
End With
End If
'Moving on to the next document '
sAnalyzedDoc = Dir
Wend
exitSub:
With Output
.Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With
With Application
.DisplayAlerts = True
.ScreenUpdating = True
End With
End Sub
My question
The idea I thought was to go via the XML content within the document to access directly to its content (which you can access when renaming any document in newer versions of Word, with a .zip
extension and going for nameOfDocument.zip\word\document.xml
).
It would be a lot faster than loading all the images, charts and tables of the word document which are of no use in a text search.
Thus, I wanted to ask if there was a way in VBA to open a word document like a zip file and access that XML document to then process it like a normal string of characters in VBA, since I already have the path and the name of the file given the above code.