0

been trying for hours to solve this problem. I want to use regular expressions to select whole divs including nested divs see example string below:

AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC

Want to return the following values

<div> Text1 </div>
<div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div>

The closes I've got is using the following code but just gives me each DIV

(?<BeginTag><\s*div.*?>)|(?<EndTag><\s*/\s*div.*?>)

Any help would be great.

Chris
  • 470
  • 1
  • 10
  • 19
  • 3
    Well, http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – driis Feb 16 '13 at 15:55
  • 1
    `I want to use regular expressions to select whole divs including nested divs` - no believe me, you don't want to use regular expressions for this task. Otherwise the hours you have already wasted attempting to make this work would quickly turn into weeks, months and years with the same result. A wise man once said: `Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.`. So have you tried using an HTML parser such as [`HTML Agility Pack`](http://htmlagilitypack.codeplex.com/)? – Darin Dimitrov Feb 16 '13 at 15:59
  • @driss put this into an answer. Unfortunately, the answer is: it is not possible. – usr Feb 16 '13 at 16:08

1 Answers1

1

To expand on my rather snarky comment, a Regex is not a good tool for parsing any kind of HTML. Only in the simplest of scenarios will it be feasible, and even then, I would not recommend it.

What you need is a good tool for parsing HTML. In the .NET world, a nice library for this is the HTMLAgilityPack or perhaps the SGMLReader project.

You do need to invest a little bit of time in learning the API, but it will be worth it.

For the little fragment you are showing, I think the easiest API for you will be SGMLReader. It can read HTML as if it were XML, which means you can convert it to an XDocument and use a much nicer API. The code for that could look like this:

string markup = "<html>AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC</html>";
XDocument doc;
using(var reader = Sgml.SgmlReader.Create(new StringReader(markup))) 
    doc = XDocument.Load(reader);

var rootLevelDivs = doc.Root.Elements("div");
foreach(var div in rootLevelDivs)
    Console.WriteLine(div);
Community
  • 1
  • 1
driis
  • 161,458
  • 45
  • 265
  • 341
  • 1
    Most importantly a regex cannot possibly parse a recursive structure with unbounded height. – usr Feb 16 '13 at 16:38