2

I need to extract this content inside the divtestimonial1 div I am using the following regEx, but its only returning the first line

Regex r = new Regex("&lt;div([^<]*<(?!/div>))");
  <div class="testimonial_content" id="divtestimonial1">
          <a name="T1"></a>
          <div class="testimonial_headline">%testimonial1headline</div>
          <p align="left"><img src="" alt="" width="193" height="204" align="left" hspace="10" id="img_T1"/><span class="testimonial_text">%testimonial1text</span><br />
          </p>
  </div>
BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
Sandhurst
  • 1,180
  • 5
  • 26
  • 40

2 Answers2

6

Regular expressions are generally not a good choice for parsing HTML. You might be better off using a tool such as HTML Agility Pack, so I would suggest you use that.

That being said, you can match your particular sample input using this Regex:

<div.*?id="divtestimonial1".*?>.*</div>

But it might break in your real-world scenario. One of the troubles with Regex and HTML is properly detecting nesting of tags, etc.

carla
  • 1,970
  • 1
  • 31
  • 44
driis
  • 161,458
  • 45
  • 265
  • 341
  • And while that is generally true, what the OP asks here is quite practical and possible with RegEx. – H H Jan 23 '11 at 17:55
1

It would not be a good option using HtmlAgilityPack?

string input = "<div class=\"testimonial_content\" id=\"divtestimonial1\"><a name=\"T1\"></a><div class=\"testemonial_headline\">% testimonial1headline</div><p align=\"left\"><img src=\"\" alt=\"\" largura=\"193\" altura=\"204\" align=\"10\" id=/><br /></p></div>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
HtmlNode divNode = doc.DocumentNode.SelectSingleNode("//div[@id='divtestimonial1']");
if (divNode != null)
{
    string content = divNode.InnerHtml;
    Console.WriteLine(content);
}

result:

<a name="T1"></a><div class="testemonial_headline">% testimonial1headline</div><p align="left"><img src="" alt="" largura="193" altura="204" align="10" id=/><br></p>

Using Regex.Match it would look like this:

string input = "<div class=\"testimonial_content\" id=\"divtestimonial1\"><a name=\"T1\"></a><div class=\"testemonial_headline\">% testimonial1headline</div><p align=\"left\"><img src=\"\" alt=\"\" largura=\"193\" altura=\"204\" align=\"10\" id=/><br /></p></div>";
Match match = Regex.Match(input, "<div class=\"testimonial_content\" id=\"divtestimonial1\">(?<content>.*?)</div>");
if (match.Success)
{
    string content = match.Groups["content"].Value;
    Console.WriteLine(content);
}