1

I have a document with labeled (and some unlabeled!) paragraphs:
1.0 ...
...
2.4.3 ...
...
6.18.21.8 ...
Etc.

I need to find all those labels, and only those labels (regardless of what the paragraph content is and what other text may be present, e.g. unlabeled paragraphs/text). The expected document format is this:

  • New paragraph character, followed by
  • One or more number characters, followed by
  • A period, followed by
  • Some number of iterations of the preceding two steps, in order (number characters and a period), followed by
  • One or more number characters, followed by
  • Two spaces

Right now I have this expression, which may be close but isn't right because Word interprets the expression inside the first set of parentheses as me wanting to repeat the match rather than the pattern. (I need the latter.)

^13([0-9]@[\.])@[0-9]@(  )

Any tips on writing a regular expression that will yield the correct results, as described above?

JJEII
  • 11
  • 4

2 Answers2

0

This matches the last 5 step of your patern, I'm not really sure what you mean by new paragraph character, but if it is always the same character, just put it at the beginning of the string.

([0-9]+.)+[0-9]+(  )

If you are opened to using VBA, here is a sub that will replace the matches with whatever you change the replace variable with. Note that you will need to activate the Regex library, which you can learn how to do here (it's for excel but works the same in word). Then add a module and paste the text bellow. I think the new character is either \n or \t but I'm not 100% sure about that.

Sub remove()
Dim reg As New RegExp
Dim pattern As String
Dim replace As String

replace = ""
pattern = "([0-9]+.)+[0-9]+(  )"
With reg
    .Global = True
    .MultiLine = True
    .IgnoreCase = False
    .pattern = pattern
End With


If reg.Test(ActiveDocument.Range.Text) Then ActiveDocument.Range.Text = reg.replace(ActiveDocument.Range.Text, replace)

End Sub
Community
  • 1
  • 1
Latch
  • 368
  • 1
  • 9
  • Thanks, Latch. I think you basically understand what I'm trying to accomplish. Unfortunately, that's not a Word regex, and the period is a special regex character, too, so needs to be escaped. (E.g. Word uses @ instead of +.) The ^13 is the paragraph character code Word recognizes for this, so I know that's right. And, Word still interprets the parentheses as a match instead of pattern, anyway. Bleh. – JJEII Jul 09 '15 at 13:43
  • @JJEII My bad, for some reason, I assumed you were using VBA. I edited my answer, hopefully that will help. – Latch Jul 09 '15 at 14:24
0

Word doesn't seem to comply to its own regex documentation. To some degree, this might be helped by using the Special drop down in the Search and Replace box. In my case, it inserts {;} instead of the documented {,} for Number of repetitions. (Once you know about the semi colon instead of the comma, you may of course insert this yourself... - On the other hand: This does seem to be different even between different versions of Word.) Talking of repetitions, Word exhibits significant trouble in handling these.

You might want to verify this searching your example and a small addition

1.0  ...
...
2.4.3  ...
...
6.18.21.8  ...
...
...1.0  ...

with ^13([0-9]@.)@[0-9]@. It actually should match the first three number - dot - sequences at the start of the respective lines - but not the fourth, where the line starts with other characters. However, on my version of word, it just matches the very first one. This is in line with ^13([0-9]{1;}.){1;}[0-9]{1;} matching the first one, only - and ^13([0-9]{1;}.){2;}[0-9]{1;} not matching anything at all. (Which mirrors at the same time your observation about repetitions of the exact sequence instead of the pattern to be matched.)

You might want to check the transcription in RegEx 101 as a proof of concept.

The closest possible to your requirements is probably either:

  • ^13[0-9.]{1;} (with the tuned up ^13[0-9.]{1;}.[0-9]{1;} again not working at all) - which unfortunately accepts patterns, you actually want to see excluded, or
  • running ^13[0-9]{1;}.[0-9]{1;}, ^13[0-9]{1;}.[0-9]{1;}.[0-9]{1;}, ^13[0-9]{1;}.[0-9]{1;}.[0-9]{1;}.[0-9]{1;}, etc., which lacks much of the regex beauty/flexibility - but is much more rigid.

Depending on your overall requirements, you might be better off using a different tool for that particular job.

BTW:

  • Word uses ? instead of . to denote any character. This is, why the dot does not need to be escaped in the above expressions.
  • Word should actually accept dot or backslash for [\.] - but requires [\\.] instead (in my version).
  • "Some number of iterations of the preceding two steps" is (along your sample code) read as meaning minimum once.
  • The trailing blanks in the above regex are lost due to the handling of blanks in HTML.
  • If you are using Words functionality for headings (meaning in particular the use of the respective heading styles): Did you at all try using the Outline view (perhaps with the text body not shown) to further your purpose ?
Abecee
  • 2,365
  • 2
  • 12
  • 20