27

Say I have a regex matching a hexadecimal 32 bit number:

([0-9a-fA-F]{1,8})

When I construct a regex where I need to match this multiple times, e.g.

(?<from>[0-9a-fA-F]{1,8})\s*:\s*(?<to>[0-9a-fA-F]{1,8})

Do I have to repeat the subexpression definition every time, or is there a way to "name and reuse" it?

I'd imagine something like (warning, invented syntax!)

(?<from>{hexnum=[0-9a-fA-F]{1,8}})\s*:\s*(?<to>{=hexnum})

where hexnum= would define the subexpression "hexnum", and {=hexnum} would reuse it.

Since I already learnt it matters: I'm using .NET's System.Text.RegularExpressions.Regex, but a general answer would be interesting, too.

johnnyRose
  • 7,310
  • 17
  • 40
  • 61
peterchen
  • 40,917
  • 20
  • 104
  • 186
  • You should play around with online tool and its saved examples. [RegEx Online](http://gskinner.com/RegExr/) – vellotis Nov 20 '11 at 20:00
  • I'm already using [expresso](http://www.ultrapico.com/Expresso.htm), thanks anyway for the suggestion :) – peterchen Nov 20 '11 at 20:27

6 Answers6

18

RegEx Subroutines

When you want to use a sub-expression multiple times without rewriting it, you can group it then call it as a subroutine. Subroutines may be called by name, index, or relative position.

Subroutines are supported by PCRE, Perl, Ruby, PHP, Delphi, R, and others. Unfortunately, the .NET Framework is lacking, but there are some PCRE libraries for .NET that you can use instead (such as https://github.com/ltrzesniewski/pcre-net).

Syntax

Here's how subroutines work: let's say you have a sub-expression [abc] that you want to repeat three times in a row.

Standard RegEx
Any: [abc][abc][abc]

Subroutine, by Name
Perl:     (?'name'[abc])(?&name)(?&name)
PCRE: (?P<name>[abc])(?P>name)(?P>name)
Ruby:   (?<name>[abc])\g<name>\g<name>

Subroutine, by Index
Perl/PCRE: ([abc])(?1)(?1)
Ruby:          ([abc])\g<1>\g<1>

Subroutine, by Relative Position
Perl:     ([abc])(?-1)(?-1)
PCRE: ([abc])(?-1)(?-1)
Ruby:   ([abc])\g<-1>\g<-1>

Subroutine, Predefined
This defines a subroutine without executing it.
Perl/PCRE: (?(DEFINE)(?'name'[abc]))(?P>name)(?P>name)(?P>name)

Examples

Matches a valid IPv4 address string, from 0.0.0.0 to 255.255.255.255:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.(?1)\.(?1)\.(?1)

Without subroutines:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))

And to solve the original posted problem:
(?<from>(?P<hexnum>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(?P>hexnum))

More Info

http://regular-expressions.info/subroutine.html
http://regex101.com/

Community
  • 1
  • 1
Beejor
  • 8,606
  • 1
  • 41
  • 31
  • Ssshhhh... people want to believe that XML/HTML is not parsable because they don't know regexes can do this. – Nakilon Nov 19 '18 at 09:31
4

.NET regex does not support pattern recursion, and if you can use (?<from>(?<hex>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(\g<hex>)) in Ruby and PHP/PCRE (where hex is a "technical" named capturing group whose name should not occur in the main pattern), in .NET, you may just define the block(s) as separate variables, and then use them to build a dynamic pattern.

Starting with C#6, you may use an interpolated string literal that looks very much like a PCRE/Onigmo subpattern recursion, but is actually cleaner and has no potential bottleneck when the group is named identically to the "technical" capturing group:

C# demo:

using System;
using System.Text.RegularExpressions;

public class Test
{
    public static void Main()
    {
        var block = "[0-9a-fA-F]{1,8}";
        var pattern = $@"(?<from>{block})\s*:\s*(?<to>{block})";
        Console.WriteLine(Regex.IsMatch("12345678  :87654321", pattern));
    }
}

The $@"..." is a verbatim interpolated string literal, where escape sequences are treated as combinations of a literal backslash and a char after it. Make sure to define literal { with {{ and } with }} (e.g. $@"(?:{block}){{5}}" to repeat a block 5 times).

For older C# versions, use string.Format:

var pattern = string.Format(@"(?<from>{0})\s*:\s*(?<to>{0})", block);

as is suggested in Mattias's answer.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
4

Why not do something like this, not really shorter but a bit more maintainable.

String.Format("(?<from>{0})\s*:\s*(?<to>{0})", "[0-9a-zA-Z]{1,8}");

If you want more self documenting code i would assign the number regex string to a properly named const variable.

Mattias Wadman
  • 11,172
  • 2
  • 42
  • 57
3

If I am understanding your question correctly, you want to reuse certain patterns to construct a bigger pattern?

string f = @"fc\d+/";
string e = @"\d+";
Regex regexObj = new Regex(f+e);

Other than this, using backreferences will only help if you are trying to match the exact same string that you have previously matched somewhere in your regex.

e.g.

/\b([a-z])\w+\1\b/

Will only match : text, spaces in the above text :

This is a sample text which is not the title since it does not end with 2 spaces.

FailedDev
  • 26,680
  • 9
  • 53
  • 73
1

There is no such predefined class. I think you can simplify it using ignore-case option, e.g.:

(?i)(?<from>[0-9a-z]{1,8})\s*:\s*(?<to>[0-9a-z]{1,8})
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
-2

To reuse regex named capture group use this syntax: \k<name> or \k'name'

So the answer is:

(?<from>[0-9a-fA-F]{1,8})\s*:\s*\k<from>

More info: http://www.regular-expressions.info/named.html

shanethehat
  • 15,460
  • 11
  • 57
  • 87
Zoli
  • 250
  • 2
  • 6