0

I am taking over a Datamining project made in C# which is parsing some raw text files in order to store usefull data's in databases.

There is no problem for now, everything is working out of the box but I have a misunderstanding with some regular expression's syntax.

In fact, why is the expression Déposé et enregistré le (?<Registred>.+?)\s*(\r\n)

Matching the string Déposé et enregistré le 16/09/2016

I expected the regular expression to be like Déposé et enregistré le ([0-9]{2}\/[0-9]{2}\/[0-9]{4}) to match my string.

The problem that makes me lost is the (?<Registred>.+?) part which in my opinion shouldn't match a date like 16/09/2016.

Here is a sample of the code matching the string :

var results = new List<RegexResult>();
String regexS = r.RegexValue;

try
{
    var regex = new System.Text.RegularExpressions.Regex(regexS, RegexOptions.None, new TimeSpan(TimeSpan.TicksPerSecond * 3));
    var matchCollection = regex.Matches(data.Data);

    if (matchCollection.Count > 0)
    {
        int occurenceCounter = 0;
        foreach (Match match in matchCollection)
        {
            string[] capturedGroup = regex.GetGroupNames();
            foreach (string groupName in capturedGroup)
            {
                string resultValue = match.Groups[groupName].Value.Trim();
                if (groupName != "0")
                {
                    results.Add(new RegexResult(data.Id, r, resultValue, groupName, occurenceCounter));
                }
                log.Info("RawData Id : {0} | Regex Id : {1} | groupName {2} : {3}", data.Id, r.Id, groupName, resultValue);
            }
            occurenceCounter++;
        }
    }
}
catch (RegexMatchTimeoutException e)
{
    log.Error("RegexMatchTimeoutException for Id {0} and regex {1}", data, regexS, e);
}            

return results;

Any ideas ?

MadJlzz
  • 767
  • 2
  • 13
  • 35
  • Actually, that will only match if there is a linebreak after the date because of `\r\n`. The dot matches any char but a newline. `+?` matches 1 or more occurrences, but as few as possible. Are you just asking for a regex explanation? – Wiktor Stribiżew Sep 26 '16 at 15:41
  • Thanks a lot for the reference. I've added it to my favorite so I won't reduplicate this kind of topic. See the answer of @dan1111 that answers my question. – MadJlzz Sep 27 '16 at 08:20
  • Well, dan just replicated http://regex101.com. – Wiktor Stribiżew Sep 27 '16 at 08:25

1 Answers1

1

This:

(?<Registred>.+?)

is a named capture group. The <Registred> part is not actually part of the match pattern, but defines a name, which could be used to refer to the matching part in parentheses.

It's the same as the following using the standard capture group syntax:

(.+?)

So it simply matches one or more characters, with the non-greedy quantifier making it match as few characters as possible.

So, the pattern will match any string that starts with "Déposé et enregistré le", followed by at least one character and then a newline.

  • Ok, everything is clearer now. I think they just didn't understand what they where doing so the expression's they wrote are kinda useless for the strings they wanted to extract. Going to make this all standard. Thanks. – MadJlzz Sep 27 '16 at 08:27