To search for strings with in a string (search for all hrefs in HTML source)

Question

I have a string variable that contains the entire HTML of a web page. The web page would contain links to other websites. I would like to create a list of all hrefs (webcrawler like ). What is the best possible way to do it ? Will using any extension function help ? what about using Regex ?

Thanks in Advance

score 3 · Accepted Answer · edited May 23 '17 at 12:26

3

Use a DOM parser such as the HTML Agility Pack to parse your document and find all links.

There's a good question on SO about how to use HTML Agility Pack available here. Here's a simple example to get you started:

string html = "your HTML here";

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(html);

var links = doc.DocumentNodes.DescendantNodes()
   .Where(n => n.Name == "a" && n.Attributes.Contains("href")
   .Select(n => n.Attributes["href"].Value);

edited May 23 '17 at 12:26

Community

1
1

answered Jun 17 '11 at 16:22

Donut

110,061
20
134
146

@ Donut : Thanks for enlightening me about HTML Agility Pack..I had never used it before. Iam now exploring it. – Ananth Jun 18 '11 at 05:08

score 1 · Answer 2 · answered Jun 17 '11 at 16:22

1

I think you'll find this answers your question to a T

http://msdn.microsoft.com/en-us/library/t9e807fx.aspx

:)

answered Jun 17 '11 at 16:22

The Evil Greebo

7,013
3
28
55

score 1 · Answer 3 · answered Jun 17 '11 at 16:23

1

I would go with Regex.

        Regex exp = new Regex(
            @"{href=}*{>}",
            RegexOptions.IgnoreCase);
        string InputText; //supply with HTTP
        MatchCollection MatchList = exp.Matches(InputText);

answered Jun 17 '11 at 16:23

therealmitchconnors

2,732
1
18
36

score 1 · Answer 4 · answered Jun 17 '11 at 16:23

1

Try this Regex (should work):

var matches = Regex.Matches (html, @"href=""(.+?)""");

You can go through the matches and extract the captured URL.

answered Jun 17 '11 at 16:23

Tim Rogers

21,297
6
52
68

score 1 · Answer 5 · edited May 23 '17 at 10:08

1

Have you looked into using HTMLAGILITYPACK? http://htmlagilitypack.codeplex.com/

With this you can simply us XPATH to get all of the links on the page and put them into a list.

private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

Taken from another post here - Get all links on html page?

edited May 23 '17 at 10:08

Community

1
1

answered Jun 17 '11 at 16:23

EvanGWatkins

1,427
6
23
52

Thanks ..I havnt looked into HTMLAGILITYPACK before..But Iam now – Ananth Jun 18 '11 at 05:10

To search for strings with in a string (search for all hrefs in HTML source)

5 Answers5