0

What I want is, opening a Link from a Website (from HtmlContent) and get the Html of this new opened site..

Example: I have www.google.com, now I want to find all Links. For each Link I want to have the HTMLContent of the new Site.

I do something like this:

foreach (String link in GetLinksFromWebsite(htmlContent))
            {
                using (var client = new WebClient())
                {
                    htmlContent = client.DownloadString("http://" + link);
                }

                foreach (Match treffer in istBildURL)
                {
                    string bildUrl = treffer.Groups[1].Value;
                    bildLinks.Add(bildUrl);
                }
            }




   public static List<String> GetLinksFromWebsite(string htmlSource)
    {
        string linkPattern = "<a href=\"(.*?)\">(.*?)</a>";
        MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
        List<string> linkContents = new List<string>();
        foreach (Match match in linkMatches)
        {
            linkContents.Add(match.Value);
        }
        return linkContents;
    }

The other problem is, that I only get Links, not Linkbuttons (ASP.NET).. How can I solve the problem?

eMi
  • 5,540
  • 10
  • 60
  • 109

1 Answers1

7

Steps to follow:

  1. Download Html Agility Pack
  2. Reference the assembly you have downloaded in your project
  3. Throw everything that starts with the word regex or regular expression out from your project and which deals with parsing HTML (read this answer to better understand why). In your case this would be the contents of the GetLinksFromWebsite method.
  4. Replace what you have thrown away with a simple call to the Html Agility Pack parser.

Here's an example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        using (var client = new WebClient())
        {
            var htmlSource = client.DownloadString("http://www.stackoverflow.com");
            foreach (var item in GetLinksFromWebsite(htmlSource))
            {
                // TODO: you could easily write a recursive function
                // that will call itself here and retrieve the respective contents
                // of the site ...
                Console.WriteLine(item);
            }
        }
    }

    public static List<String> GetLinksFromWebsite(string htmlSource)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(htmlSource);
        return doc
            .DocumentNode
            .SelectNodes("//a[@href]")
            .Select(node => node.Attributes["href"].Value)
            .ToList();
    }
}
Community
  • 1
  • 1
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • thx 4 the answer, I will test it and give Feedback and mark the answer if it works :D – eMi Nov 08 '11 at 07:55