-1

I have 3 strings from wich I want to extract the movie title, if posible in one RegularExpression

<title>Airplane! (1980)</title>    

<title>&#x22;24&#x22; (2001)</title>    

<title>&#x22;Agents of S.H.I.E.L.D.&#x22; The Magical Place (2014)</title>

My best shot so far is this one:

<title>(&#x22;)?(.*?)(&#x22;)?.*?\((\d{4})\).*?</title>

Works fine for "Agents of S.H.I.E.L.D." and "24" but not for "Airplane!".

What am I doing wrong?

Even though it might not be clear the regular expression are called within a C# program, and I'm using RegEx

Jens Borrisholt
  • 6,174
  • 1
  • 33
  • 67
  • 2
    Airplane close tag is missing `/` – alpha bravo Nov 07 '14 at 17:37
  • I'm not sure what you mean @alphabravo – Jens Borrisholt Nov 07 '14 at 17:40
  • he means you sample Airplane title tag should end with `` yours is `< title>` its not your regex, its your sample data – hometoast Nov 07 '14 at 17:41
  • That's just a typo, makes no difference. It is so in the real html. Just for the record: I've just edited my question – Jens Borrisholt Nov 07 '14 at 17:42
  • 4
    Why are you using Regular Expressions? XML is not a regular language. You should be using an XML library. Or if it's HTML, then you should something like the [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/). – mason Nov 07 '14 at 17:45
  • It is HTML wich I get from imdb so I can not change the format – Jens Borrisholt Nov 07 '14 at 17:47
  • I didn't say you should change the format. I said you should use al library to extract the data. In your case, the HtmlAgilityPack. Regular expressions should *not* be used to extract information from HTML. – mason Nov 07 '14 at 17:49
  • Mason, yes, regex for html is "bad", but in this case, he's just parsing one tag. lets presume he can get that tag reliably. – hometoast Nov 07 '14 at 17:50
  • 7
    You should also use [IMDB's API](http://stackoverflow.com/questions/1966503/does-imdb-provide-an-api) instead of retrieving HTML. It'll be easier to work with as it returns XML instead of HTML. HTML is not a format for passing data programmatically, it's a markup language for displaying content visually. XML is however a well recognized format for passing data between applications. – mason Nov 07 '14 at 17:51
  • 2
    @hometoast I didn't say it was impossible. I said it [shouldn't be done](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not), especially when there's much better options out there. – mason Nov 07 '14 at 17:53
  • This is just a small project for myself. The main purpose is to learn som RegularExpression. So thankyou for your advices about doing it diffrent, which i really appreciate. But I would also like an answer to my question. WITHIN the scope that RexEx and Html is the only solution in the world. – Jens Borrisholt Nov 07 '14 at 17:54
  • 1
    Using an HTML parser would still require you to use an RE to parse the single text node data which is 99% of this task – Alex K. Nov 07 '14 at 17:55
  • Frankly, your regex and your test data all check out in regexlib, regexbuddy, and regexhero. I find no error. – hometoast Nov 07 '14 at 17:56
  • Strange because I am using regexhero. Expression: (")?(.*?)(")?.*?\((\d{4})\).*? Test data: Airplane! (1980) result two groups: "2: " "4: 1980" – Jens Borrisholt Nov 07 '14 at 17:57
  • @AlexK. Yes, but many of them offer better support for the intricacies of HTML, in the case of malformed HTML. Leave it to the experts-no need to reinvent the wheel. – mason Nov 07 '14 at 17:58
  • I see the police was here, and didn't like my qestion. because they didn't read the comments – Jens Borrisholt Nov 14 '14 at 07:52

1 Answers1

1

RE for start-of-line => opening tag => optional " => read until " or (nnnn)

titles = System.Net.WebUtility.HtmlDecode(titles);

foreach (Match match in Regex.Matches(titles, 
         @"^\s*<title>\s*\""*(.*?)(\""|\(\d{4}\))", RegexOptions.Multiline | RegexOptions.IgnoreCase))
{
    if (match.Success)
    {
        string name = match.Groups[1].Value;
    }
}
Alex K.
  • 171,639
  • 30
  • 264
  • 288