Regular expression to isolate text from some sample html?

Question

I'm curious to find the C# regex expression that extracts the following:

<a id=sector href="?catid=us-58211593" >Financial</a>

... from this html string:

<div class="g-unit g-first">Sector: <a id=sector href="?catid=us-58211593" >Financial</a> &gt; Industry: <a href="?catid=us-64965887" >Misc. Financial Services</a>

The text "href="?catid=us-58211593" is not relevant, so it should be matching on the "a" and "id=sector" elements.

Update

Indeed - RegEx is just not the right tool for the job. It only took 3 lines of code from the HTML Agility Pack to achieve the required result:

HtmlWeb hw = new HtmlWeb();
HtmlDocument myDoc = hw.Load("http://www.google.com/finance?q=IBM");
var etc = myDoc.GetElementbyId("sector").InnerText;

You probably want an HTML parser as there are any number of references as to why Regex isn't sufficient to parse HTML. Check out [Html Agility Pack](http://htmlagilitypack.codeplex.com/). — lsuarez, Jun 15 '11 at 20:17

score 3 · Accepted Answer · edited May 23 '17 at 12:12

3

Don't use Regex to parse HTML. There are better solutions, such as HTML Agility Pack.

edited May 23 '17 at 12:12

Community

1
1

answered Jun 15 '11 at 20:16

driis

161,458
45
265
341

2

ohhhhhhh, you are so totally correct :-) I am surprised that there are still developers that consider regex a tool for parsing HTML. – Darin Dimitrov Jun 15 '11 at 20:18
2

Every time I see "regex" and "HTML" in a question title together, I cry a little on the inside. – Justin Morgan - On strike Jun 15 '11 at 21:02

Regular expression to isolate text from some sample html?

1 Answers1