0

I need to analyze text of my Word document, and create bookmarks on range of text my analyzer has detected (almost like a grammar checker).

I don't want use Find() utility, because my needs are too specific.


Explanations

For that,

1/ Retrieve Document plain text

I Retrieve Plain text of the main story of my document :

String plainText = ActiveDocument.Range().Text;

2/ Analyze plain text and get results

I send it to my analyzer tool which return a collection of marker with position : For example, if I wanted to detected the pattern "my pattern" in the document text, analyzer could return a marker as { pattern : "my marker", start: 5, end : 14 }, where "start" and "end" are the character indexes of the pattern in the plain text sent.

3/ Display results in Document

I create bookmark from theses markers For previously example, it woold be :

// init a new range and collapse it 
Word.Range range = activeDocument.Range(); range.Collapse(WdCollapseStart); 

// move character-by-character in the "formatted" text
range.MoveStart(WdUnits.Character, Marker.start ); # Marker.start=5

//set length (end)
range.setRange(range.Start,range.Start+(Marker.End-Marker.Start)); #Marker.end=14

4/ Results

4.1 Global Result

Everything is OK when Document Main Story Contains Text, links, lists, titles : Ranges are well positionned, Plain Text indexes correlate with formatted text indexes.

4.2 Arrays Issue

When a document contains an array, Ranges are bad positionned a few characters : Plain Text indexes correlate not exactly with formatted text indexes.

I found the reason of this issue (It was explained in others forums) : this is due to non printing char(7), which is a cell delimiter added in plain text. We can handle these chars to calculate position range and everything is OK !

4.3 Issue for Content Controls, Table of contents, Sections and others

When a document contains theses elements, Ranges are also bad positionned a few characters. Others non printing appears in plain text but I don't understand what it means and how deal with to calculate position range.

By displaying Word element markers with "Developer ribbon > creation mode", we see 2 markers per elements : shifting plain text indexes by 2*elements resolve issues. It's seems OK.

4.4 Issue with Endpaper

I don't know how we says "page de garde" (french) in english, I think it's "endpaper" : this is the first page with specific header, footer and content controls :)

When a document contains an Endpaper, Ranges are also bad positionned a few characters. But this time, there are not non printing marker in the plain text.

Other info, when I display word element markers with "Developer ribbon > creation mode", I see endpaper markers.


Questions

  • How detect Endpaper in Word Document Range ?
  • How understand Plain Text indexes don't always correlate with formatted text indexes, in function of Word document elements which contains ?
  • XML nodes manipulation would be a more reliable alternative for that? If yes, could you give me good examples to manage bookmars or others in current document with XML Api ?

Others ressources

I found similar issues :

I hope my explanations are clear and you can help me to understand what is wrong or show me a best way to do that ?

Thanks, really.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Koryonik
  • 2,728
  • 3
  • 22
  • 27

1 Answers1

2

It's not really pretty but you can try to remove the unwanted characters by Regex. For example to remove the \a letters (it has code 7):

string j = new string(new char[] { (char)7 });
plainText = Regex.Replace(plainText,string.Format("[{0}]", j), "");

Now you have to identify the other 'evil' characters and add them to the char array. If it works you will get a string whose length corresponds with the number of Characters in your document. Probably you have to adapt this code by experimenting. (I was not sure which language you are using - I supposed C#.)

Update Another idea (if it is applicable to your analyzer tool):

Break your problem down to single paragraphs:

foreach(Word.Paragraph pg in activeDocument.Paragraphs)
{
    Word.Range range = pg.Range();
    string text = range.Text;
    // your stuff here
}

With this paragraph range objects and the contained text strings you do the same as you tried to do with the whole document object and its text - just paragraph by paragraph. All these paragraphs are 'addressable' by ranges and Move operations as you already do it. I suppose that the problematic characters are outside or at the end of the paragraphs so they don't influence the character counting inside these paragraphs.

As I can't reproduce what you call endpaper I can't validate it. Besides I don't know if special text ranges as page headers and tables of content are covered by paragraphs. But at least you can reduce your problem to smaller ranges. I think it is worth trying.

Fratyx
  • 5,717
  • 1
  • 12
  • 22
  • Thanks for your response. But I'try to resolve the issue with this solution, without success. range.Text.Replace("\a", " ").Replace("\u000B", "").Replace("\u000C", ""); – Koryonik Oct 15 '14 at 13:44
  • Strange... I got rid of the `\a`s in this way. Doesn't the `.Replace("\a", "")` at least solve the table problem? (And may it be that there is a space in your replacement for \a?) – Fratyx Oct 15 '14 at 13:54
  • Yes for table. I'm already using this hack for table cells issues. BUt If I add a endpaper ( a first page with content control), each cell shift of one char (the \a char removed, I think), but only with first page. Very strange :( – Koryonik Oct 15 '14 at 14:58