1

I have this string

<p/><ul><li>test1<p/></li><li>test2<p/></li></ul><p/>

What i attempt to do is extract all the "p" tag within the "li" tag, but not the "p" tag outside of it.

I'm only able so far to extract all the "li" tags by

\<li\>(.*?)\</li\>

I'm lost at how to extract the "p" tag within it.

Any pointer is greatly appreciated it!!

Liming
  • 1,641
  • 3
  • 28
  • 38

3 Answers3

5

It is a lot more reliable to use an HTML parser instead of a regex. Use HTML Agility Pack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<p/><ul><li>test1<p/></li><li>test2<p/></li></ul><p/>");
IEnumerable<HtmlNode> result = doc.DocumentNode
                                  .Descendants("li")
                                  .SelectMany(x => x.Descendants("p"));
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Thanks Marks. Actually, I'm parsing bbcode out a bunch of text and after the last iteration of converting bbcode, the text came out like that, so I need to do a bit clean up. But thanks for the suggestion though. – Liming Mar 05 '10 at 22:51
2
<li>(.*?<p/?>.*?)</li>

Will match all content between <li> which also contain a <p/>. If you just want to match the <p/> then:

(?<=<li>).*?(<p/?>).*?(?=</li>)

Will have group 1 match the <p/> tag.

Pindatjuh
  • 10,550
  • 1
  • 41
  • 68
2

Try this, it uses lookahead so that the LI is not part of the selection.

(?<=<li>)(.*?<p/?>.*?)(?=</li>)

P.S. You also need to fix your HTML because the way you have P tags is not right. The Regex works on this HTML below.

<ul><li><p>test1<p/></li><li><p>test2<p/></li></ul>
James
  • 12,636
  • 12
  • 67
  • 104