C# Getting The HTML of the Links(Content) from Website

Question

What I want is, opening a Link from a Website (from HtmlContent) and get the Html of this new opened site..

Example: I have www.google.com, now I want to find all Links. For each Link I want to have the HTMLContent of the new Site.

I do something like this:

foreach (String link in GetLinksFromWebsite(htmlContent))
            {
                using (var client = new WebClient())
                {
                    htmlContent = client.DownloadString("http://" + link);
                }

                foreach (Match treffer in istBildURL)
                {
                    string bildUrl = treffer.Groups[1].Value;
                    bildLinks.Add(bildUrl);
                }
            }




   public static List<String> GetLinksFromWebsite(string htmlSource)
    {
        string linkPattern = "<a href=\"(.*?)\">(.*?)</a>";
        MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
        List<string> linkContents = new List<string>();
        foreach (Match match in linkMatches)
        {
            linkContents.Add(match.Value);
        }
        return linkContents;
    }

The other problem is, that I only get Links, not Linkbuttons (ASP.NET).. How can I solve the problem?

You really should use an Html Parser, like `HtmlAgilityPack` — Oskar Kjellin, Nov 07 '11 at 15:34

score 7 · Accepted Answer · edited May 23 '17 at 12:07

Steps to follow:

Download Html Agility Pack
Reference the assembly you have downloaded in your project
Throw everything that starts with the word regex or regular expression out from your project and which deals with parsing HTML (read this answer to better understand why). In your case this would be the contents of the GetLinksFromWebsite method.
Replace what you have thrown away with a simple call to the Html Agility Pack parser.

Here's an example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        using (var client = new WebClient())
        {
            var htmlSource = client.DownloadString("http://www.stackoverflow.com");
            foreach (var item in GetLinksFromWebsite(htmlSource))
            {
                // TODO: you could easily write a recursive function
                // that will call itself here and retrieve the respective contents
                // of the site ...
                Console.WriteLine(item);
            }
        }
    }

    public static List<String> GetLinksFromWebsite(string htmlSource)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(htmlSource);
        return doc
            .DocumentNode
            .SelectNodes("//a[@href]")
            .Select(node => node.Attributes["href"].Value)
            .ToList();
    }
}

thx 4 the answer, I will test it and give Feedback and mark the answer if it works :D — eMi, Nov 08 '11 at 07:55

C# Getting The HTML of the Links(Content) from Website

1 Answers1