regex: how exclude a possible word that follows if it does

Question

I'm reading a list by line and using regex in c# to capture the fields:

fed line 1: Type: eBook Year: 1990 Title: This is ebook 1 ISBN:15465452 Pages: 100 Authors: Cendric, Paul

fed line 2: Type: Movie Year: 2016 Title: This is movie 1 Authors: Pepe Giron ; Yamasaki Suzuki Length: 4500 Media Type: DVD

string pattern = @"(?:(Type: )(?<type>\w+)) *(?:(Year: )(?<year>\d{4})) *(?:(Title: )(?<title>[^ISBN]*))(?:(ISBN:) *(?<ISBN>\d*))* *(?:(Pages: )(?<pages>\d*))* *(?:(Authors: )(?<author1>[\w ,]*)) *;* *(?<author2>[\w ,]*) *(?:(Length: )(?<length>\d*))* *(?:Media Type: )*(?<discType>[\w ,]*)";

MatchCollection matches = Regex.Matches(line, pattern);

If the line fed has "Length: " I want to stop capturing the surname of the Author excluding the word Length.

If I use (?:(Length: )(?<length>\d*))* Length is added to the surname of the second author for match.Groups["author2"].Value. If I use (?:(Length: )(?<length>\d*))+ I get no matches for the first line.

Can you please give me guidance. Thank you, Sergio

*[^ISBN]* is wrong. It will stop to any one of the letters I, S, B, N, not at the word ISBN — xanatos, May 20 '18 at 16:38
And in general your solution seems to be quite brittle, unless you are very sure of the format of the line. It would be probably better to search for the tags (Type:, Year:, ...) "manually" (with `string.IndexOf`) instead of fighting against the vagaries of regexes — xanatos, May 20 '18 at 16:43
I agree with @xanatos - these look like name/value pairs. I would write code to parse them into that. — Kevin, May 20 '18 at 16:49
What are the rules? What makes `Media Type` different from `Suzuki Length` — Eser, May 20 '18 at 16:56
@Eser `keyword:` (keyword colon) is a reserved word (e.g. Media Type:)... That is brittle in itself. Whoever wrote the format should be kicked down the hill. — xanatos, May 20 '18 at 17:04

score 1 · Answer 1 · answered May 20 '18 at 17:08

Using full regexes for something as fuzzy as the format you have is always a way for hurting themselves. As written by @Kevin, you should look for the keys and extract the values.

My proposal is looking for those keys and splitting the string before and after them. There is a nifty, randomly working (they even changed its working between .NET 1.1 and .NET 2.0), nearly unknown feature of Regex that is called Regex.Split(). We could try to use it :-)

string pattern = @"(?<=^| )(Type: |Year: |Title: |ISBN:|Pages: |Authors: |Length: |Media Type: )";
var rx = new Regex(pattern);
string[] parts = rx.Split(line);

Now parts is an array where if in an element there is a key, in the next element there is the value... The Regex.Split can add an empty element at the beginning of the array.

string type = null, title = null, mediaType = null;
int? year, length;
string[] authors = new string[0];


// The parts[0] == string.Empty ? 1 : 0 is caused by the "strangeness" of Regex.Split
// that can add an empty element at the beginning of the string
for (int i = parts[0] == string.Empty ? 1 : 0; i < parts.Length; i += 2)
{
    string key = parts[i].TrimEnd();
    string value = parts[i + 1].Trim();
    Console.WriteLine("[{0}|{1}]", key, value);

    switch (key)
    {
        case "Type:":
            type = value;
            break;
        case "Year:":
            {
                int temp;
                if (int.TryParse(value, out temp))
                {
                    year = temp;
                }
            }
            break;
        case "Title:":
            title = value;
            break;
        case "Authors:":
            {
                authors = value.Split(" ; ");
            }
            break;
        case "Length:":
            {
                int temp;
                if (int.TryParse(value, out temp))
                {
                    length = temp;
                }
            }
            break;
        case "Media Type:":
            mediaType = value;
            break;
    }
}

score 1 · Answer 2 · answered May 21 '18 at 00:42

After all, @xanathos is right. An overcomplicated regex that is hard to maintain and error prone may not serve you well in the long run.

But to answer your question, your regex can be fixed with a tempered greedy token*, e.g. do not allow Length: in the author's pattern:

(?:(?:(?!Length: )[\w ,])*)

_{* The linked description uses a . in the greedy token but it's useful to limit the range of allowed characters more here.}

Arguably, this should be added to the author1 and author2 part.

The final pattern then looks like this:

(?:(Type: )(?<type>\w+)) *(?:(Year: )(?<year>\d{4})) *(?:(Title: )(?<title>[^ISBN]*))(?:(ISBN:) *(?<ISBN>\d*))* *(?:(Pages: )(?<pages>\d*))* *(?:(Authors: )(?<author1>(?:(?:(?!Length: )[\w ,])*) *)) *;* *(?<author2>(?:(?:(?!Length: )[\w ,])*) *)(?:(Length: )(?<length>\d*))* *(?:Media Type: )*(?<discType>[\w ,]*)

Demo

regex: how exclude a possible word that follows if it does

2 Answers2