4

I'm using regular expressions to search against the plain text returned by the following property:

namespace Microsoft.Office.Interop.Word
{
    public class Range
    {
        ...
        public string Text { get; set; }
        ...
    }
}

Based upon the matches I want to make changes to the formatted text that corresponds to the plain text. The problem I have is that the indices of characters in the .Text property do not match up with the .Start and .End properties of the Range object. Does anyone know any way to match these indices up?

(I can't use the Word wildcard find capabilities (as a replacement for .NET regular expressions) because they aren't powerful enough for the patterns I'm searching (non-greedy operators etc.))

I can move the correct number of characters by starting with Document.Range().Collapse(WdCollapseStart)and then range.MoveStart(WdUnitChar, match.Index) since moving by characters matches the formatted text position up with the matches in the plain text.

My problem now is that I'm always 4 characters too far along in the formatted text...so maybe it has something to do with the other story ranges? I'm not sure...

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Carl G
  • 17,394
  • 14
  • 91
  • 115
  • How are you pulling the range into RegEx? (i.e. characters, sentences, paragraphs, etc.) – Todd Main Sep 22 '10 at 20:32
  • I perform a RegEx.Match on Document.Range().Text to find all matches in the document. See my update for possible solution (still not quite there though.) – Carl G Sep 22 '10 at 22:40
  • Great that you found a solution! You can post it as an answer and then accept it as the correct answer. – Todd Main Sep 24 '10 at 18:09
  • If you need more information on this topic you may want to also consider: http://stackoverflow.com/questions/29552095/word-vba-iterating-through-characters-incredibly-slow – CRC Apr 17 '15 at 17:01

1 Answers1

1

Apparently the reason my matches were still off had to do with hidden "Bell" characters (char bell = '\a';). By replacing these with the empty string inside Application.ActiveDocument.Range().Text, my matches on this property now match up correctly with the range achieved by:

Word.Range range = activeDocument.Range();
range.Collapse(Word.WdCollapseStart);
range.MoveStart(Word.WdUnits.Character, regexMatch.Index);

Basically you can mirror indexes in the .Text property by moving through the formatted text character-by-character. The only caveat is that you need to remove the strange characters such as the bell character from the .Text property.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Carl G
  • 17,394
  • 14
  • 91
  • 115
  • Question is old, but today I meet the same issue with Word document containing advanced elements as sections, table of contents, ...etc. Solution works only with document containing basic elements like formatted text, table, links, lists... Did you meet same issue? – Koryonik Oct 13 '14 at 20:02
  • @Koryonik, yes, other elements can cause problems. I just discovered that in Word 2013, content controls take up space in the Range but aren't in the Range.Text. This is the opposite of Carl G's problem. Rather than deleting characters from Range.Text, you have to insert a blank at the beginning and end of each content control to get things to line up. – cxw Apr 01 '15 at 14:40