0

I'm curious to find the C# regex expression that extracts the following:

<a id=sector href="?catid=us-58211593" >Financial</a>

... from this html string:

<div class="g-unit g-first">Sector: <a id=sector href="?catid=us-58211593" >Financial</a> &gt; Industry: <a href="?catid=us-64965887" >Misc. Financial Services</a> 

The text "href="?catid=us-58211593" is not relevant, so it should be matching on the "a" and "id=sector" elements.

Update

Indeed - RegEx is just not the right tool for the job. It only took 3 lines of code from the HTML Agility Pack to achieve the required result:

HtmlWeb hw = new HtmlWeb();
HtmlDocument myDoc = hw.Load("http://www.google.com/finance?q=IBM");
var etc = myDoc.GetElementbyId("sector").InnerText;
Contango
  • 76,540
  • 58
  • 260
  • 305
  • 3
    You probably want an HTML parser as there are any number of references as to why Regex isn't sufficient to parse HTML. Check out [Html Agility Pack](http://htmlagilitypack.codeplex.com/). – lsuarez Jun 15 '11 at 20:17

1 Answers1

3

Don't use Regex to parse HTML. There are better solutions, such as HTML Agility Pack.

Community
  • 1
  • 1
driis
  • 161,458
  • 45
  • 265
  • 341