1

I have developed a regex to use in a .NET WebAPI that gets a date and a control code from a given input already formatted in final format.

I tried regex to avoid using multiple string splits.

I've been using Regex101 to test my expression and I have one that already works as expected by I think it's too large for what it does.

Expression:

^([0-9]{2})+([0-9]{2})+([0-9]{2})[0-9](M|F)([0-9]{2})+([0-9]{2})+([0-9]{2})

// Get in format Year, Month, Day, Code(M|F), Year, Month, Day

Input:

7603259M2209058PRT<<<<<<<<<<<8

Do you have any suggestions to simplify it?

MaartenDev
  • 5,631
  • 5
  • 21
  • 33
  • 1
    1) The `+` are wrong in your pattern, remove them all, 2) use `\d` to match a digit, and pass `RegexOptions.ECMAScript` option, 3) do not use alterantion with single chars, use a character class. `new Regex(@"^(\d{2})(\d{2})(\d{2})\d([MF])(\d{2})(\d{2})(\d{2})", RegexOptions.ECMAScript)` – Wiktor Stribiżew Jan 07 '22 at 16:11
  • 1
    How does `2209058` describe "Year, Month, Day"? Year = 22, Month = 09, Day = 058??? – Mathias R. Jessen Jan 07 '22 at 16:12
  • @MathiasR.Jessen year=22, Month=09, Day=05 thats why I'm expecting one more digit before Code(M|F) – João Mendes Jan 07 '22 at 16:13
  • @WiktorStribiżew thanks for the advice, I kept the [0-9] because it's equivalent to \d , I don't really know if the engine transform \d to [0-9] or how it woks. I just would like to have less statements. – João Mendes Jan 07 '22 at 16:15
  • In that case @WiktorStribiżew is spot on, you don't need all those open-ended quantifiers. `^([0-9]{2})([0-9]{2})([0-9]{2})[0-9][MF]([0-9]{2})([0-9]{2})([0-9]{2})` should do – Mathias R. Jessen Jan 07 '22 at 16:16
  • `\d` with `RegexOptions.ECMAScript` = `[0-9]`, else, `\d` = `\p{Nd}` – Wiktor Stribiżew Jan 07 '22 at 16:16
  • `\d` != `[0-9]` by default - `\d` will match non-latin numeric digits too, eg. `๔` (4 in Thai numerals) – Mathias R. Jessen Jan 07 '22 at 16:17
  • Wasn't aware thanks for the explanation but in this case I think [0-9] will fit best because I'm not expecting non-latin numerics. This is data provided by a CSV file. – João Mendes Jan 07 '22 at 16:20
  • @JoãoMendes If you read my comments and answer (and see the online demo) you will see that `\d` with `ECMAScript` option does not match any Thai digits. – Wiktor Stribiżew Jan 07 '22 at 16:25

1 Answers1

1

There is one issue with your regex: you quantified the two-digit matching capturing groups with a + quantifier, making them match one or more times. ([0-9]{2})+ matches one or more sequences of any two ASCII digits, while keeping the last captured value in the corresponding group. See Repeating a Capturing Group vs. Capturing a Repeated Group.

You need to remove all + chars from your pattern and then you can also use the following:

  • Use \d to match any digit while passing the RegexOptions.ECMAScript option to the regex compile method so that it can only match ASCII digits (otherwise, \d will be equal to \p{Nd} and will match any Unicode digits, see \d less efficient than [0-9])
  • Instead of alterantion with single chars ((M|F)), use a character class, ([MF]), this is more efficient (see Why is a character class faster than alternation?).

You can use

var pattern = new Regex(@"^(\d{2})(\d{2})(\d{2})\d([MF])(\d{2})(\d{2})(\d{2})", RegexOptions.ECMAScript);

See the .NET regex demo.

If you want to use and even shorter regex you may use:

var pattern = new Regex(@"^(?:(\d{2})){3}\d([MF])(?:(\d{2})){3}", RegexOptions.ECMAScript);
var match = pattern.Match("7603259M2209058PRT<<<<<<<<<<<8");
if (match.Success)
{
    Console.WriteLine(match.Groups[1].Captures[0].Value); // => 76
    Console.WriteLine(match.Groups[1].Captures[1].Value); // => 03
    Console.WriteLine(match.Groups[1].Captures[2].Value); // => 25
    Console.WriteLine(match.Groups[2].Value);             // => M
    Console.WriteLine(match.Groups[3].Captures[0].Value); // => 22
    Console.WriteLine(match.Groups[3].Captures[1].Value); // => 09
    Console.WriteLine(match.Groups[3].Captures[2].Value); // => 05
}

See the C# demo and this regex demo.

Note this is possible because .NET Regex allows access to all the captures inside the group stack.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for the answer I applied your suggestion. I would like to find a way to make Regex expression smaller instead of repeating the pattern to create the group "(\d{2})" 6 times – João Mendes Jan 07 '22 at 16:28
  • @JoãoMendes I added another solution with repeated capturing groups, but note it is not quite portable, and less intuitive. Shorter in regex world does not always mean better. – Wiktor Stribiżew Jan 07 '22 at 16:39
  • This was exactly what I was looking for, I had something like this but in Regex101 it was not showing as a captured value so I thought it wouldn't work. I will keep the advice to be careful that it doesn't work everywhere but in my case I think it's the best fit. Thank you very much! – João Mendes Jan 07 '22 at 16:40
  • @JoãoMendes Make sure you test your patterns at a *compatible* online regex tester. Regex101 does not support .NET regex flavor. – Wiktor Stribiżew Jan 07 '22 at 16:51