-1

I am trying to find tag article and all it's content in HTML string using Regex.

I can successfully match open tag with attrs: <article[^>]*>

I've got issues with matching contents. (.*?) - this technique is not working for me.

Please help.

Andrei
  • 42,814
  • 35
  • 154
  • 218
  • 1
    Please share complete information. what issues are you getting? Why can't you use HTML parser API? – Braj Aug 17 '14 at 12:11
  • You are saying, c#, why not you are using Linq to Xml. Definitely we need more details on xml structure to answer. – codebased Aug 17 '14 at 12:12
  • It may be worth using something like HTML agility pack as this is going to be tricky to do properly with regex alone http://htmlagilitypack.codeplex.com/ – geedubb Aug 17 '14 at 12:13
  • 1
    Obligatory reference: http://stackoverflow.com/a/1732454/1149773 – Douglas Aug 17 '14 at 12:13
  • I am planning to use Linq2Xml when I get all necessary tags. The DOM structure of a page I am trying to parse is not parsing using XElement.Parse. – Andrei Aug 17 '14 at 12:14

1 Answers1

1

You cannot use regular expressions to parse HTML in general. However, in constrained scenarios (i.e. when the input follows a rigid structure), you might be able to get away with it. In your case, you can use the following regex, provided that:

  • The <article> tags are not self-closing
  • The <article> elements do not contain other <article> descendants
  • The strings <article and </article> do not appear as literals in your HTML.

Code:

var matches = Regex.Matches(html, @"<article.*?</article>", RegexOptions.Singleline);
Douglas
  • 53,759
  • 13
  • 140
  • 188