0

Possible Duplicate:
Regular expression for parsing links from a webpage?

How can I find all urls from HTML using regular expression. I need only url for pages so I want to add exclusion of urls which end with ".css" or ".jpg" or ".js" etc.

Example of HTML:

<a href=index.php?option=content&amp;task=view&amp;id=2&amp;Itemid=25 class="menu_selected" id="">Home</a>

or

<a href="http://data.stackexchange.com">data</a> |
                <a href="http://shop.stackexchange.com/">shop</a> |
                <a href="http://stackexchange.com/legal">legal</a> |

Thanks

Community
  • 1
  • 1
Liza24
  • 59
  • 2
  • 6
  • string strRef = @"(href|HREF)[ ]*=[ ]*[""'][^""'#>]+[""']"; MatchCollection matches = new Regex(strRef).Matches(strResponse); – Liza24 Jun 21 '12 at 14:48

1 Answers1

2

If you can, avoid using Regular Expressions, but instead use a proper HTML parser. For example, reference the HTML Agility Pack, and use the following:

var doc = new HtmlDocument();
doc.LoadHtml(yourHtmlInput);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")
                              ?? Enumerable.Empty<HtmlNode>())
{
    string href = link.Attributes["href"].Value;
    if (!String.IsNullOrEmpty(href))
    {
        // Act on the link here, including ignoring it if it's a .jpg etc.
    }
}
Rawling
  • 49,248
  • 7
  • 89
  • 127
  • I think, regular expression will be faster than HTML Agility Pack; please correct if I am wrong – Liza24 Jun 21 '12 at 14:52
  • It will probably be faster; HTML Agility Pack is likely to be more robust. I really only posted this because I had the code to hand from a project I did recently :) – Rawling Jun 21 '12 at 14:53
  • 1
    RegEx won't work at all: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html . Simple rule of thumb: use Regular Expressions to parse "Regular Languages" - HTML is not a "Regular Language" (refer to the Chomsky Hierarchy for more information). – Dai Jun 21 '12 at 14:56
  • Regex might be faster but parsing HTML with regex is problematic. How many do you need to parse? – paparazzo Jun 21 '12 at 14:57
  • I need to parse so many pages(upto 5000). The application is multithreaded so I want each thread to finish work soon – Liza24 Jun 21 '12 at 15:31
  • 5 thosand is not so many. Maybe give HTML Agility Pack a try. – paparazzo Jun 21 '12 at 15:52
  • @Blam At least it's not over 9000. – Rawling Jun 21 '12 at 15:53