RegEX style for HTML code

Question

Hey all, what would the regEX code be for the following:

<br/><span class=""synopsis-view-synopsis"">America's justice system comes under indictment in director <a href='/people/1035' class='actor' style='font-weight:bold'>Norman Jewison</a>'s trenchant film starring <a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a> as upstanding attorney Arthur Kirkland. A hard-line -- and tainted -- judge (<a href='/people/1034' class='actor' style='font-weight:bold'>John Forsythe</a>) stands accused of rape, and Kirkland (<a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a>) has to defend him. Kirkland has a history with the judge, who jailed one of the lawyer's clients on a technicality. When the judge confesses his guilt, Kirkland faces an ethical and legal quandary. </span>

Ive tried this:

regex = New System.Text.RegularExpressions.Regex("(?<=""synopsis-view-synopsis""\>)([^<\/span><]+)")

But that only seems to get the first part of the description; Americ

Any help would be great! :o)

David

What part of "the following" are you trying to match? The inner text? The whole line? Certain tags? Something else? You may also want to ask yourself if this is something that could be made easier with an HTML parser, but I won't make any assumptions at this point... — eldarerathis, Apr 04 '11 at 17:37
Have a look at the following post: http://stackoverflow.com/questions/181095/regular-expression-to-extract-text-from-html — Kevin, Apr 04 '11 at 17:38
Also http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jakub Hampl, Apr 04 '11 at 17:39
@eldarerathis: The inner text... **America's justice system comes....** — StealthRT, Apr 04 '11 at 17:41
@JakubHampl has the right idea... [Beware of Zalgo](http://stackoverflow.com/a/1732454/135078) — Kelly S. French, Jan 12 '12 at 22:50

score 1 · Accepted Answer · answered Apr 04 '11 at 22:11

I don't see any need for lookaheads or lookbehinds here; just match the whole <span> element and use a capturing group extract its content. Assuming there will never be any <span> elements inside the one you're matching, this should be all you need:

Regex rgx = new Regex(
    @"<span\s+class=""synopsis-view-synopsis"">(.*?)</span>",
    RegexOptions.IgnoreCase | RegexOptions.Singleline);

foreach (Match m in rgx.Matches(s0))
{
  Console.WriteLine(m.Groups[1].Value);
}

Also, [^<\/span><]+ doesn't do what you probably think it does. What you've got there is a character class that matches any one character except <, /, s, p, a, n, or >. You may have been trying for this:

(?:(?!</span>).)+

...which matches one character at a time, after the lookahead confirms that the character isn't the beginning of the sequence </span>. It's a valid technique, but (as with the lookarounds) I don't think you need anything so fancy here.

That only seems to gather the first description and not any others following that one. I tried changing your **(.*?)** to **(.+?)** but that does not seem to work. — StealthRT, Apr 06 '11 at 12:20
**First** description? I don't see anything in the question about matching more than one of anything. — Alan Moore, Apr 07 '11 at 01:01

score 0 · Answer 2 · edited May 23 '17 at 11:55

0

(?=""synopsis-view-synopsis""\>).+(?!<\/span>)

Should probably work. Try using an HTML parser instead!

edited May 23 '17 at 11:55

Community

1
1

answered Apr 04 '11 at 17:54

Jakub Hampl

39,863
10
77
106

Didn't seem to work. There were 2 of them and it only displayed the first one and never found the second. – StealthRT Apr 04 '11 at 17:58
You didn't mention having two in your question. Change the `.+` to `.+?` and witch on a global flag. That should do the trick. Or use an HTML parser. – Jakub Hampl Apr 04 '11 at 18:14
Did not get anything that time around changing **.+** to **.+?** – StealthRT Apr 04 '11 at 18:25
Then thats a problem in your code not in the regexp or possibly the fact that you are doing something with regexp that you are not supposed to. – Jakub Hampl Apr 04 '11 at 18:27
I think the `(?=` at the start of this regex should be `(?<=` instead. – bw_üezi Apr 04 '11 at 18:56

bw_üezi · Answer 3 · 2011-04-04T19:10:02.280

in .net there are different methods for "match" and "matches all" these are:

re.Match(str);   // regex 're' match in string 'str'
re.Matches(str)  // regex 're' matches all in string 'str'

update

Explain to regex

(?<=regex) is positive lookbehind
(?!regex) is a negativ lookahead
.+ finally matches anything between the lookaround

Raw Match Pattern:

(?<=""synopsis-view-synopsis""\>).+(?!</span>)

C#.NET Code Example:

using System;
using System.Text.RegularExpressions;
namespace myapp
{
  class Class1
    {
      static void Main(string[] args)
        {
          String sourcestring = 
            "<br/><span class=""synopsis-view-synopsis"">America's justice... </span>
             <br/><span class=""synopsis-view-synopsis"">Canada's justice... </span>";

          Regex re = new Regex(@"(?<=""""synopsis-view-synopsis""""\>).+(?!</span>)");
          MatchCollection mc = re.Matches(sourcestring);
          int mIdx=0;
          foreach (Match m in mc)
           {
            for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
              {
                Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
              }
            mIdx++;
          }
        }
    }
}

Matches Found:

[0][0] = America's justice... </span>
[1][0] = Canada's justice... </span>

@StealthRT see update. used this [regex tester](http://www.myregextester.com/index.php) — bw_üezi, Apr 04 '11 at 18:42

RegEX style for HTML code

3 Answers3