Find tag with contents in HTML using Regex

Question

I am trying to find tag article and all it's content in HTML string using Regex.

I can successfully match open tag with attrs: <article[^>]*>

I've got issues with matching contents. (.*?) - this technique is not working for me.

Please help.

Please share complete information. what issues are you getting? Why can't you use HTML parser API? — Braj, Aug 17 '14 at 12:11
You are saying, c#, why not you are using Linq to Xml. Definitely we need more details on xml structure to answer. — codebased, Aug 17 '14 at 12:12
It may be worth using something like HTML agility pack as this is going to be tricky to do properly with regex alone http://htmlagilitypack.codeplex.com/ — geedubb, Aug 17 '14 at 12:13
Obligatory reference: http://stackoverflow.com/a/1732454/1149773 — Douglas, Aug 17 '14 at 12:13
I am planning to use Linq2Xml when I get all necessary tags. The DOM structure of a page I am trying to parse is not parsing using XElement.Parse. — Andrei, Aug 17 '14 at 12:14

score 1 · Accepted Answer · answered Aug 17 '14 at 12:18

You cannot use regular expressions to parse HTML in general. However, in constrained scenarios (i.e. when the input follows a rigid structure), you might be able to get away with it. In your case, you can use the following regex, provided that:

The <article> tags are not self-closing
The <article> elements do not contain other <article> descendants
The strings <article and </article> do not appear as literals in your HTML.

Code:

var matches = Regex.Matches(html, @"<article.*?</article>", RegexOptions.Singleline);

Find tag with contents in HTML using Regex

1 Answers1