1

I want match all phone numbers that are wrapped between << and >> tags.
This regex for phone numbers:

0[2349]{1}\-[1-9]{1}[0-9]{6}

I tired to add lookahead (and lookbehind) like (?=(?:>>)) but this didn't work for me.

DEMO

shivesh
  • 631
  • 2
  • 13
  • 22
  • 1
    Must this all be done with a single RegExp? Couldn't you just get all the matches for `<<(.*?)>>` then search within them using your current RegExp? – gnarf Jun 18 '10 at 11:45
  • @shivesh: how many wrapping pattern can there be in the input? Only one `<<...>>` section, or can there be multiple? If multiple, can they be nested? – polygenelubricants Jun 18 '10 at 12:02
  • Yes I prefer to have one regex. – shivesh Jun 18 '10 at 12:03
  • @shivesh: you didn't answer the question. How many `<<...>>` sections can there be in the input? Can there be multiple? If so, can they be nested? – polygenelubricants Jun 18 '10 at 12:17
  • @ polygenelubricants, there can be any number of nested <<>> – shivesh Jun 18 '10 at 12:22
  • @shivesh: if you can nest the `<<..>>` then you can't do this with regex. Or perhaps you can in .NET, but essentially there's a paranthesis balancing subproblem which isn't a regular language. – polygenelubricants Jun 18 '10 at 12:37
  • 1
    @shivesh: "I prefer to have on regex"..? I totally agree with gnarf. Using two regular expressions makes your solution trivial, simple and easy to maintain. Why on earth try to squeeze it down to one hellish regular expression? – simendsjo Jun 18 '10 at 12:39

5 Answers5

4

The following seems to work (as seen on ideone.com):

Regex r = new Regex(@"(?s)<<(?:(?!>>)(?:(0[2349]\-[1-9][0-9]{6})|.))*>>");

Each <<...>> section is a Match, and all phone numbers in that section will be captured in Group[1].Captures.

Related questions


How the pattern is constructed

First of all, I simplified your phone number pattern to:

0[2349]\-[1-9][0-9]{6}

That is, the {1} is superfluous, so they get thrown away (see Using explicitly numbered repetition instead of question mark, star and plus).

Then, let's try to match each <<...>> section. Let's start at:

(?s)<<((?!>>).)*>>

This will match each <<..>> section. The .* to capture the body is guarded by a negative lookahead (?!>>), so that we don't go out of bound.

Then, instead of matching ., we give priority to matching your phone number instead. That is, we replace . with

(phonenumber|.)

Then I simply made some groups non-capturing, and the phone number captures to \1 and that's pretty much it. The fact that .NET regex stores all captures made by a group in a single match took care of the rest.

References

Community
  • 1
  • 1
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
0
<<0[2349]{1}\-[1-9]{1}[0-9]{6}>>
sra
  • 23,820
  • 7
  • 55
  • 89
MrFox
  • 4,852
  • 7
  • 45
  • 81
0

I placed a similar question some time ago, using brackets ([]) instead of <<>>:

Link here

This should really help Cheers

Edit: It should support your demo no problem.

Community
  • 1
  • 1
João Pereira
  • 3,545
  • 7
  • 44
  • 53
0

This can easily be done with two regex patterns:

To identify the section:

<<.*>>

Use the second regex on the matches from the first:

0[2349]-[1-9]\d{6}

Remember to set dot to match new line. I know it isn't exactly what you were asking, but it will work.

Jesper Fyhr Knudsen
  • 7,802
  • 2
  • 35
  • 46
  • Is it possible to do it with just one regex? – shivesh Jun 18 '10 at 12:02
  • @shivesh, as the accepted answer shows it is possible to do in one regex, but it also show the biggest problem with regex, they very easily become very unreadable and hard to maintain. Unless it is strictly necessary I usually spit it up into smaller, easier to understand patterns. – Jesper Fyhr Knudsen Jun 18 '10 at 14:31
0

I think gnarf's (and Arkain's) suggestion is very sensible – you don't have to use one regex to do all the work.

But, if you really want to use one hard-to-read unportable (works only in .Net, not in other regex engines) regex, here you go:

(?<=<<(?:>?[^>])*)0[2349]{1}\-[1-9]{1}[0-9]{6}(?=(?:<?[^<])*>>)
svick
  • 236,525
  • 50
  • 385
  • 514
  • if I run regex 2 times on the input text isn't it less efficient than 1 regex? – shivesh Jun 18 '10 at 12:27
  • Maybe, but most of the time, readability and maintainability is more important than some minor difference in efficiency. Also, the lookaheads and lookbehids I used can be quite inefficient, so using two regexes may actually be faster too. – svick Jun 18 '10 at 12:44