Word VBA: iterating through characters incredibly slow

Question

I have a macro that changes single quotes in front of a number to an apostrophe (or close single curly quote). Typically when you type something like "the '80s" in word, the apostrophe in front of the "8" faces the wrong way. The macro below works, but it is incredibly slow (like 10 seconds per page). In a regular language (even an interpreted one), this would be a fast procedure. Any insights why it takes so long in VBA on Word 2007? Or if someone has some find+replace skills that can do this without iterating, please let me know.

Sub FixNumericalReverseQuotes()
    Dim char As Range
    Debug.Print "starting " + CStr(Now)
    With Selection
        total = .Characters.Count
        ' Will be looking ahead one character, so we need at least 2 in the selection
        If total < 2 Then
            Return
        End If
        For x = 1 To total - 1
            a_code = Asc(.Characters(x))
            b_code = Asc(.Characters(x + 1))

            ' We want to convert a single quote in front of a number to an apostrophe
            ' Trying to use all numerical comparisons to speed this up
            If (a_code = 145 Or a_code = 39) And b_code >= 48 And b_code <= 57 Then
                .Characters(x) = Chr(146)
            End If 
        Next x
    End With
    Debug.Print "ending " + CStr(Now)
End Sub

iumo · Accepted Answer · 2015-04-27T07:34:15.697

5

Beside two specified (Why...? and How to do without...?) there is an implied question – how to do proper iteration through Word object collection. Answer is – to use obj.Next property rather than access by index. That is, instead of:

For i = 1 to ActiveDocument.Characters.Count
    'Do something with ActiveDocument.Characters(i), e.g.:
    Debug.Pring ActiveDocument.Characters(i).Text
Next

one should use:

Dim ch as Range: Set ch = ActiveDocument.Characters(1)
Do
    'Do something with ch, e.g.:
    Debug.Print ch.Text
    Set ch = ch.Next 'Note iterating
Loop Until ch is Nothing

Timing: 00:03:30 vs. 00:00:06, more than 3 minutes vs. 6 seconds.

Found on Google, link lost, sorry. Confirmed by personal exploration.

edited Apr 27 '15 at 07:34

answered Apr 25 '15 at 13:00

iumo

66
2

Yes! This is the right answer. Implementing my original macro using the Next iterator works perfectly on large documents and interacts well with track changes. It runs in a reasonable amount of time (about 3-4 seconds on 10,000 words). – Justin L. Apr 28 '15 at 21:09
@iumo What if it was a Table Cell Range ? How would you loop from start to end just in that specific range ? – pelican_george Oct 07 '15 at 18:52
isn't this the same as `For Each ch In ActiveDocument.Characters ... Next ch` ? – M D P Sep 09 '18 at 14:38

Comintern · Answer 2 · 2015-04-19T20:14:31.987

This is a problem begging for regular expressions. Resolving the .Characters calls that many times is probably what is killing you in performance.

I'd do something like this:

Public Sub FixNumericalReverseQuotesFast()

    Dim expression As RegExp
    Set expression = New RegExp

    Dim buffer As String
    buffer = Selection.Range.Text

    expression.Global = True
    expression.MultiLine = True
    expression.Pattern = "[" & Chr$(145) & Chr$(39) & "]\d"

    Dim matches As MatchCollection
    Set matches = expression.Execute(buffer)

    Dim found As Match
    For Each found In matches
        buffer = Replace(buffer, found, Chr$(146) & Right$(found, 1))
    Next

    Selection.Range.Text = buffer

End Sub

NOTE: Requires a reference to Microsoft VBScript Regular Expressions 5.5 (or late binding).

EDIT: The solution without using the Regular Expressions library is still avoiding working with Ranges. This can easily be converted to working with a byte array instead:

Sub FixNumericalReverseQuotes()
    Dim chars() As Byte
    chars = StrConv(Selection.Text, vbFromUnicode)

    Dim pos As Long
    For pos = 0 To UBound(chars) - 1
        If (chars(pos) = 145 Or chars(pos) = 39) _
        And (chars(pos + 1) >= 48 And chars(pos + 1) <= 57) Then
           chars(pos) = 146
        End If
    Next pos

    Selection.Text = StrConv(chars, vbUnicode)
End Sub

Benchmarks (100 iterations, 3 pages of text with 100 "hits" per page):

Regex method: 1.4375 seconds
Array method: 2.765625 seconds
OP method: (Ended task after 23 minutes)

About half as fast as the Regex, but still roughly 10ms per page.

EDIT 2: Apparently the methods above are not format safe, so method 3:

Sub FixNumericalReverseQuotesVThree()

    Dim full_text As Range
    Dim cached As Long

    Set full_text = ActiveDocument.Range
    full_text.Find.ClearFormatting
    full_text.Find.MatchWildcards = True
    cached = full_text.End

    Do While full_text.Find.Execute("[" & Chr$(145) & Chr$(39) & "][0-9]")
        full_text.End = full_text.Start + 2
        full_text.Characters(1) = Chr$(96)
        full_text.Start = full_text.Start + 1
        full_text.End = cached
    Loop

End Sub

Again, slower than both the above methods, but still runs reasonably fast (on the order of ms).

I would love to use regular expressions and that was my first instinct too, but I don't seem to have a regular expression library available to me in my environment. This post suggests that I should be able to turn it on, but I don't see that in my mac environment. http://stackoverflow.com/questions/25102372/how-to-use-enable-regexp-object-regular-expression-using-vba-macro-in-word — Justin L., Apr 13 '15 at 23:42
@JustinL. - Fair enough. You weren't lying about it being slow - edit coming as soon as my benchmark finishes... — Comintern, Apr 14 '15 at 00:33
Amazing that its faster to do it that way. I wonder what is going on inside of Characters(). The "Array method" works for me, except that I made one change: instead of replacing the entire block, I make the replacement directly in the Selection.Characters. Since there are only a handful of edits in a document this is still fast. The advantage is that you don't lose formatting and track changes is still sensible. — Justin L., Apr 15 '15 at 01:57
@JustinL. - Anything in Word (and Excel for that matter) that evaluates to a Range carries a *ton* of overhead. Most of it is keeping track of formatting, the relative position in the document, etc. A good way to conceptualize it is to look at all the properties a Range has - that's what you create when you call .Characters(x). My philosophy is to avoid them unless I need to use a specific property or method. — Comintern, Apr 15 '15 at 02:04
@JustinL. - Completely missed the formatting problems you were running into. See Edit 2. — Comintern, Apr 19 '15 at 20:15

score 0 · Answer 3 · answered Apr 15 '15 at 02:01

Modified version of @Comintern's "Array method":

Sub FixNumericalReverseQuotes()
    Dim chars() As Byte
    chars = StrConv(Selection.Text, vbFromUnicode)

    Dim pos As Long
    For pos = 0 To UBound(chars) - 1
        If (chars(pos) = 145 Or chars(pos) = 39) _
        And (chars(pos + 1) >= 48 And chars(pos + 1) <= 57) Then
           ' Make the change directly in the selection so track changes is sensible.
           ' I have to use 213 instead of 146 for reasons I don't understand--
           ' probably has to do with encoding on Mac, but anyway, this shows the change.
           Selection.Characters(pos + 1) = Chr(213)
        End If
    Next pos
End Sub

Unfortunately my answer doesn't appear very robust. On larger documents the index from the byte array eventually stops aligning with the character index. It works okay with smaller paragraphs. The Regexp version is probably the best answer even though I can't use it in my environment. — Justin L., Apr 15 '15 at 02:19

Dan Donoghue · Answer 4 · 2015-04-15T22:33:18.380

Maybe this?

Sub FixNumQuotes()
    Dim MyArr As Variant, MyString As String, X As Long, Z As Long
    Debug.Print "starting " + CStr(Now)
    For Z = 145 To 146
        MyArr = Split(Selection.Text, Chr(Z))
        For X = LBound(MyArr) To UBound(MyArr)
            If IsNumeric(Left(MyArr(X), 1)) Then MyArr(X) = "'" & MyArr(X)
        Next
        MyString = Join(MyArr, Chr(Z))
        Selection.Text = MyString
    Next
    Selection.Text = Replace(Replace(Selection.Text, Chr(146) & "'", "'"), Chr(145) & "'", "'")
    Debug.Print "ending " + CStr(Now)
End Sub

I am not 100% sure on your criteria, I have made both an open and close single quote a ' but you can change that quite easily if you want.

It splits the string to an array on chr(145), checks the first char of each element for a numeric and prefixes it with a single quote if found.

Then it joins the array back to a string on chr(145) then repeats the whole things for chr(146). Finally it looks through the string for an occurence of a single quote AND either of those curled quotes next to each other (because that has to be something we just created) and replaces them with just the single quote we want. This leaves any occurence not next to a number intact.

This final replacement part is the bit you would change if you want something other than ' as the character.

I think the final assignment of the Replace string into Selection.Text is going to have the same problem: it doesn't really work with track changes and it will nuke all formatting in the document, right? — Justin L., Apr 17 '15 at 17:33
Try is and see, without having your actual data to test on it was a little bit of guess work — Dan Donoghue, Apr 17 '15 at 21:03

score 0 · Answer 5 · answered Apr 17 '15 at 16:05

I have been struggling with this for days now. My attempted solution was to use a regular expression on document.text. Then, using the matches in a document.range(start,end), replace the text. This preserves formatting.

The problem is that the start and end in the range do not match the index into text. I think I have found the discrepancy - hidden in the range are field codes (in my case they were hyperlinks). In addition, document.text has a bunch of BEL codes that are easy to strip out. If you loop through a range using the character method, append the characters to a string and print it you will see the field codes that don't show up if you use the .text method.

Amazingly you can get the field codes in document.text if you turn on "show field codes" in one of a number of ways. Unfortunately, that version is not exactly the same as what the range/characters shows - the document.text has just the field code, the range/characters has the field code and the field value. Therefore you can never get the character indices to match.

I have a working version where instead of using range(start,end), I do something like:

Set matchRange = doc.Range.Characters(myMatches(j).FirstIndex + 1)           
matchRange.Collapse (wdCollapseStart)
Call matchRange.MoveEnd(WdUnits.wdCharacter, myMatches(j).Length)
matchRange.text = Replacement

As I say, this works but the first statement is dreadfully slow - it appears that Word is iterating through all of the characters to get to the correct point. In doing so, it doesn't seem to count the field codes, so we get to the correct point.

Bottom line, I have not been able to come up with a good way to match the indexing of the document.text string to an equivalent range(start,end) that is not a performance disaster.

Ideas welcome, and thanks.

Word VBA: iterating through characters incredibly slow

5 Answers5

Linked