0

Hey all, what would the regEX code be for the following:

<br/><span class=""synopsis-view-synopsis"">America's justice system comes under indictment in director <a href='/people/1035' class='actor' style='font-weight:bold'>Norman Jewison</a>'s trenchant film starring <a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a> as upstanding attorney Arthur Kirkland. A hard-line -- and tainted -- judge (<a href='/people/1034' class='actor' style='font-weight:bold'>John Forsythe</a>) stands accused of rape, and Kirkland (<a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a>) has to defend him. Kirkland has a history with the judge, who jailed one of the lawyer's clients on a technicality. When the judge confesses his guilt, Kirkland faces an ethical and legal quandary. </span>

Ive tried this:

regex = New System.Text.RegularExpressions.Regex("(?<=""synopsis-view-synopsis""\>)([^<\/span><]+)")

But that only seems to get the first part of the description; Americ

Any help would be great! :o)

David

StealthRT
  • 10,108
  • 40
  • 183
  • 342
  • What part of "the following" are you trying to match? The inner text? The whole line? Certain tags? Something else? You may also want to ask yourself if this is something that could be made easier with an HTML parser, but I won't make any assumptions at this point... – eldarerathis Apr 04 '11 at 17:37
  • Have a look at the following post: http://stackoverflow.com/questions/181095/regular-expression-to-extract-text-from-html – Kevin Apr 04 '11 at 17:38
  • 3
    Also http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jakub Hampl Apr 04 '11 at 17:39
  • @eldarerathis: The inner text... **America's justice system comes....** – StealthRT Apr 04 '11 at 17:41
  • @JakubHampl has the right idea... [Beware of Zalgo](http://stackoverflow.com/a/1732454/135078) – Kelly S. French Jan 12 '12 at 22:50

3 Answers3

1

I don't see any need for lookaheads or lookbehinds here; just match the whole <span> element and use a capturing group extract its content. Assuming there will never be any <span> elements inside the one you're matching, this should be all you need:

Regex rgx = new Regex(
    @"<span\s+class=""synopsis-view-synopsis"">(.*?)</span>",
    RegexOptions.IgnoreCase | RegexOptions.Singleline);

foreach (Match m in rgx.Matches(s0))
{
  Console.WriteLine(m.Groups[1].Value);
}

Also, [^<\/span><]+ doesn't do what you probably think it does. What you've got there is a character class that matches any one character except <, /, s, p, a, n, or >. You may have been trying for this:

(?:(?!</span>).)+

...which matches one character at a time, after the lookahead confirms that the character isn't the beginning of the sequence </span>. It's a valid technique, but (as with the lookarounds) I don't think you need anything so fancy here.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • That only seems to gather the first description and not any others following that one. I tried changing your **(.*?)** to **(.+?)** but that does not seem to work. – StealthRT Apr 06 '11 at 12:20
  • **First** description? I don't see anything in the question about matching more than one of anything. – Alan Moore Apr 07 '11 at 01:01
0
(?=""synopsis-view-synopsis""\>).+(?!<\/span>)

Should probably work. Try using an HTML parser instead!

Community
  • 1
  • 1
Jakub Hampl
  • 39,863
  • 10
  • 77
  • 106
  • Didn't seem to work. There were 2 of them and it only displayed the first one and never found the second. – StealthRT Apr 04 '11 at 17:58
  • You didn't mention having two in your question. Change the `.+` to `.+?` and witch on a global flag. That should do the trick. Or use an HTML parser. – Jakub Hampl Apr 04 '11 at 18:14
  • Did not get anything that time around changing **.+** to **.+?** – StealthRT Apr 04 '11 at 18:25
  • Then thats a problem in your code not in the regexp or possibly the fact that you are doing something with regexp that you are not supposed to. – Jakub Hampl Apr 04 '11 at 18:27
  • I think the `(?=` at the start of this regex should be `(?<=` instead. – bw_üezi Apr 04 '11 at 18:56
0

in .net there are different methods for "match" and "matches all" these are:

re.Match(str);   // regex 're' match in string 'str'
re.Matches(str)  // regex 're' matches all in string 'str' 

update

Explain to regex

  • (?<=regex) is positive lookbehind
  • (?!regex) is a negativ lookahead
  • .+ finally matches anything between the lookaround

Raw Match Pattern:

(?<=""synopsis-view-synopsis""\>).+(?!</span>)

C#.NET Code Example:

using System;
using System.Text.RegularExpressions;
namespace myapp
{
  class Class1
    {
      static void Main(string[] args)
        {
          String sourcestring = 
            "<br/><span class=""synopsis-view-synopsis"">America's justice... </span>
             <br/><span class=""synopsis-view-synopsis"">Canada's justice... </span>";

          Regex re = new Regex(@"(?<=""""synopsis-view-synopsis""""\>).+(?!</span>)");
          MatchCollection mc = re.Matches(sourcestring);
          int mIdx=0;
          foreach (Match m in mc)
           {
            for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
              {
                Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
              }
            mIdx++;
          }
        }
    }
}

Matches Found:

[0][0] = America's justice... </span>
[1][0] = Canada's justice... </span>
bw_üezi
  • 4,483
  • 4
  • 23
  • 41