0

I need to write a MS Word macro to count occurence of every word within a given document and print out the list like . I did the macro and it works, but it is so sloooow, it takes several hours to get results for a document of 60000 words. Could you please give me some advices/suggestions on how to make the macro run faster?

(I checked a similar question here WORD VBA Count Word Occurrences but still don't get it how to speed up and need my macro to be reviewed). Thank you.

Private Type WordStatData
    WordText As String
    WordCount As Integer
End Type

Option Base 1
'Check if the word is valid

Private Function IsValidWord(SomeString As String) As Boolean
    Dim Retval As Boolean
    Retval = True
    If Not (InStr(SomeString, " ") = 0) Then Retval = False
    If Not (InStr(SomeString, ".") = 0) Then Retval = False
    If Not (InStr(SomeString, ",") = 0) Then Retval = False
    If Not InStr(SomeString, "0") = 0 Then Retval = False
    If Not InStr(SomeString, "1") = 0 Then Retval = False
    If Not InStr(SomeString, "2") = 0 Then Retval = False
    If Not InStr(SomeString, "3") = 0 Then Retval = False
    If Not InStr(SomeString, "4") = 0 Then Retval = False
    If Not InStr(SomeString, "5") = 0 Then Retval = False
    If Not InStr(SomeString, "6") = 0 Then Retval = False
    If Not InStr(SomeString, "7") = 0 Then Retval = False
    If Not InStr(SomeString, "8") = 0 Then Retval = False
    If Not InStr(SomeString, "9") = 0 Then Retval = False
    IsValidWord = Retval
End Function

Private Sub CommandButton1_Click()
    SpanishLCID = 3082 'The source text is in Spanish
    ListBox1.Clear
    Dim WordsTotal As Long
    WordsTotal = ActiveDocument.Words.Count
    TextBox1.Text = Str(WordsTotal)
    Dim Wordfound As Boolean
    Dim NewWord As String
    Dim MyData() As WordStatData
    ReDim Preserve MyData(1)
    NewWord = ""
    For i = 1 To WordsTotal
        NewWord = Trim(StrConv(Trim(ActiveDocument.Words(i)), vbLowerCase, SpanishLCID))
        'Check if the word is in the list
        If IsValidWord(NewWord) Then
            Wordfound = False
            For j = 1 To UBound(MyData)
                If StrComp(MyData(j).WordText, NewWord) = 0 Then
                    Wordfound = True: Exit For
                End If
            Next j
            If Wordfound Then
                MyData(j).WordCount = MyData(j).WordCount + 1
            Else
                ReDim Preserve MyData(UBound(MyData) + 1)
                MyData(UBound(MyData)).WordText = NewWord
                MyData(UBound(MyData)).WordCount = 1
            End If
        End If
    Next i
    'Printing out the word list
    For i = 1 To UBound(MyData)
        ListBox1.AddItem (MyData(i).WordText & "=" & Str(MyData(i).WordCount))
    Next i
End Sub
Community
  • 1
  • 1
Alex
  • 113
  • 6
  • 2
    I will take a look, but a quick fix you could exit the IsValidWord once its been set to false, as youre checking space to 9 each time. – Nathan_Sav Nov 10 '15 at 19:28
  • Thanks, I should have implemented that.. – Alex Nov 10 '15 at 19:38
  • 3
    Loop over all of the words in the document, adding them to a Scripting Dictionary: each time you find a new word, increment the value for that "key". You can then check your list against the Dictionary to find the count for each word. – Tim Williams Nov 10 '15 at 19:44

1 Answers1

4

Add a reference to the Microsoft Scripting Runtime (Tools -> References...). Then use the following:

Private Sub CommandButton1_Click()
    Const SpanishLCID = 3082
    Dim dict As New Scripting.Dictionary, word As Variant, fixedWord As String
    Dim key As Variant

    dict.CompareMode = SpanishLCID
    For Each word In ActiveDocument.Words
        fixedWord = Trim(StrConv(Trim(word), vbLowerCase, SpanishLCID))
        If Not dict.Exists(fixedWord) Then
            dict(fixedWord) = 1
        Else
            dict(fixedWord) = dict(fixedWord) + 1
        End If
    Next

    ListBox1.Clear
    For Each key In dict.Keys
        ListBox1.AddItem key & "=" & dict(key)
    Next
End Sub

NB. Word treats each punctuation symbol or paragraph as a new word. It may be advisable to specify another Dictionary or Collection with the strings that shouldn't be added to the dictionary, and test for those strings using .Exists before adding to the dictionary.


A more concise version of IsValidWord without regular expressions:

Function IsValidWord(s As String) As Boolean
    Const validChars As String = "abcdefghijklmnopqrstuvwxyz"
    Dim i As Integer, char As String * 1
    For i = 1 To Len(s)
        char = Mid(s, i, 1)
        If InStr(1, validChars, char, vbTextCompare) = 0 Then Exit Function
    Next
    IsValidWord = True
End Function

and using regular expressions (add a reference to Microsoft VBScript Regular Expressions 5.5):

Dim regex As RegExp
Function IsValidWord2(s As String) As Boolean
    If regex Is Nothing Then
        Set regex = New RegExp
        regex.Pattern = "[^a-z]"
        regex.IgnoreCase = True
    End If
    IsValidWord2 = Not regex.Test(s)
End Function

and using regular expressions with replacement:

Function GetValidWord(s As String) As String
    'GetValidWord("Introduction.......3") will return "Introduction"
    If regex2 Is Nothing Then
        Set regex2 = New RegExp
        regex2.Pattern = "[^a-z]"
        regex2.Global = True
        regex2.IgnoreCase = True
    End If
    GetValidWord = regex2.Replace(s, "")
End Function

and you would use it as follows:

    For Each word In ActiveDocument.Words
        fixedWord = Trim(StrConv(Trim(word), vbLowerCase, SpanishLCID))
        fixedWord = GetValidWord(fixedWord)
        If Not dict.Exists(fixedWord) Then

NB: You might combine the language conversion and Trim into GetValidWord.

Community
  • 1
  • 1
Zev Spitz
  • 13,950
  • 6
  • 64
  • 136
  • Thank you! It does the job very fast! – Alex Nov 11 '15 at 17:53
  • However I've added the IsValidWord() function to avoid wrong words to beadded and counted (do not know how to predict all possible versions of wrong words to add into a special dictionary to filter them out, e.g. "Introduction.....................................3", or some Unicode strings that looks like just a whitespace) – Alex Nov 11 '15 at 17:59
  • You might want to have a look at [Regular Expressions in VBA](http://stackoverflow.com/a/22542835/4600127) to specify valid words? – Verzweifler Dec 18 '15 at 07:19