3

I am trying to scrape a webpage for links of articles.

This is my code:

static void Main(string[] args)
{
    WebClient web = new WebClient();
    string html = web.DownloadString("http://www.dailymirror.lk");
    MatchCollection m1 = Regex.Matches(html, @"<a href=""(.+?)""/s*class=""panel-heading"">",RegexOptions.Singleline);

    foreach(Match m in m1)
    {
        Console.WriteLine(m.Groups[1].Value);
    }
}

The html markup that I am focused on in the page is this:

<a href="http://www.dailymirror.lk/99833/ravi-s-budget-blues" class="panel-heading">

However, my code is unable to retrieve the link, is there anyway I could revamp my code?

svick
  • 236,525
  • 50
  • 385
  • 514
Adhil
  • 1,678
  • 3
  • 20
  • 31
  • 2
    It's generally a bad idea to scrape HTML with Regex. Use a proper library such as [HtmlAgilityPack](https://htmlagilitypack.codeplex.com/) – Equalsk Dec 17 '15 at 15:37
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags :) – Ian P Dec 17 '15 at 15:37
  • 1
    @IanP -- somebody was going to paste that link. I think it is mandatory. ;) – David Tansey Dec 17 '15 at 17:21

1 Answers1

4

As mentioned in comments above, parsing html with a regular expression is generally a bad idea.

One approach is to use the HTML Agility Pack:

https://htmlagilitypack.codeplex.com/

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://www.mywebsite.com");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]"))
{
    // do something with link here
}
Ian P
  • 12,840
  • 6
  • 48
  • 70