3

I've been struggling with this for a while

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?<tag>[^\s/>]+)(?<innerHtml>.*)(?<closeTag>[^\s>]+)>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

string tag = matches[0].Groups["tag"].Value; // "h2"
string innerHtml = matches[0].Groups["innerHtml"].Value; // ">hello world</h"
string closeTag = matches[0].Groups["closeTag"].Value; // "2"

As can be seen tag works as expected while the innerHtml and closeTag does not. Any advice? Thanks.

Update

The input string may vary, this is another scenario "<div class='myclass'><h2>hello world</h2></div>"

Eric Herlitz
  • 25,354
  • 27
  • 113
  • 157

2 Answers2

1

Try matching the > and </ outside of the capture groups, like this:

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?<tag>[^\s/>]+)>(?<innerHtml>.*)</(?<closeTag>[^\s>]+)>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

Update More specific example that should be a little more flexible:

var matches = Regex.Matches(
    "<div class='myclass'><h2>hello world</h2></div>",
    @"<(?<tag>[^\s>]+)               #Opening tag
        \s*(?<attributes>[^>]*)\s*>  #Attributes inside tag (optional)
      (?<innerHtml>.*)               #Inner Html
      </(?<closeTag>\1)>             #Closing tag, must match opening tag",
    RegexOptions.IgnoreCase | 
    RegexOptions.Compiled | 
    RegexOptions.Multiline |
    RegexOptions.IgnorePatternWhitespace);

string tag = matches[0].Groups["tag"].Value;             // "div"
string attr = matches[0].Groups["attributes"].Value;     // "class='myclass'"
string innerHtml = matches[0].Groups["innerHtml"].Value; // "<h2>hello world</h2>"
string closeTag = matches[0].Groups["closeTag"].Value;   // "div"
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • Thanks, but that is a bit to simple, if I for instance would like to test this string `"

    hello world

    "` it breaks :/
    – Eric Herlitz Mar 14 '14 at 19:38
  • @EricHerlitz What should the result be in that case? – p.s.w.g Mar 14 '14 at 19:41
  • Running your regex the result would be `tag = "h2"` `innerHtml = "hello world"` `closeTag = "/div"` – Eric Herlitz Mar 14 '14 at 19:44
  • @EricHerlitz I understand, but what would you *like* the result to be? `tag: "h2", innerHtml: "hello world", closeTag: "h2"` or `tag: "div", innerHtml: "

    hello world

    ", closeTag: "div"`? In other words, are you trying to find the outermost tag, or the innermost tag? Please update your question to be more specific.
    – p.s.w.g Mar 14 '14 at 19:46
  • The second is right, the outmost tag, your regex minus the `s/` on the seems to work fine `@"<(?[^\>]+)>(?.*)<(?[^\s>]+)>",` – Eric Herlitz Mar 14 '14 at 19:50
0

You want the Singleline option, not Multiline. Singleline enables . to match linefeeds, while Multiline changes the behavior of the anchors (^ and $), which you aren't using.

Also, if you want the closing tag to have the same name as the opening tag, you should use a backreference. Here I've used '' as the name delimiters instead of <> to reduce confusion:

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?'tag'[^/>]+)(?'innerHtml'.*)</\k'tag'>",
    RegexOptions.IgnoreCase | RegexOptions.Singleline);

And you don't need the Compiled option. All it does is make it more expensive to create the Regex object, for an increase in performance that you almost certainly don't need and won't notice.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156