How to extract html links from html file in C#?

Question

Can anyone help me by explaining how to extract urls/links from HTML File in C#

Check out http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack — Austin Salonen, Feb 25 '10 at 17:26

score 11 · Answer 1 · edited May 02 '13 at 00:09

11

HtmlDocument doc = new HtmlDocument(); 
doc.Load("file.htm");  
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) 
{
    HtmlAttribute att = link.Attributes["href"];
    yourList.Add(att.Value)  
}  
doc.Save("file.htm");

edited May 02 '13 at 00:09

Carlos

5,405
21
68
114

answered Feb 25 '10 at 17:24

Sergey Mirvoda

3,209
2
26
30

1

Do this. Parsing HTML with RegEx can be a very tedious task, Html Agility Pack will save you a lot of time. – Nathan Taylor Feb 25 '10 at 17:35
Agreed, HTML Agility pack is the way to go. – Dan Diplo Feb 26 '10 at 08:38
One up for the Html Agility pack! – thijs Feb 26 '10 at 09:05

ABCD · Answer 2 · 2013-01-03T00:26:29.997

1

Use HTMLAgility Pack...

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(r => r.Attributes.ToList().ConvertAll(i => i.Value)).SelectMany(j => j).ToList();
    }

It works for me.

edited Jan 03 '13 at 00:26

answered Jan 03 '13 at 00:18

ABCD

7,914
9
54
90

seagulf · Answer 3 · 2013-07-21T03:05:47.943

You can use an HTQL COM object and query the page using query: <a>:href

HTQLCOMLib.HtqlControl h = new HTQLCOMLib.HtqlControl();
string page = "<html><body><a href='test1.html'>test1</a><a href='test2.html'>test2</a> </body></html>";
h.setSourceData(page, page.Length);
h.setQuery("<a>: href ");
for (h.moveFirst(); 0 == h.isEOF(); h.moveNext() )
{
     MessageBox.Show(h.getValueByIndex(1));
}

It will show messages of:

test1.html

test2.html

How to extract html links from html file in C#?

3 Answers3

Linked

Related