Can anyone help me by explaining how to extract urls/links from HTML File in C#
Asked
Active
Viewed 3,809 times
3 Answers
11
look at Html Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
HtmlAttribute att = link.Attributes["href"];
yourList.Add(att.Value)
}
doc.Save("file.htm");

Carlos
- 5,405
- 21
- 68
- 114

Sergey Mirvoda
- 3,209
- 2
- 26
- 30
-
1Do this. Parsing HTML with RegEx can be a very tedious task, Html Agility Pack will save you a lot of time. – Nathan Taylor Feb 25 '10 at 17:35
-
-
1
Use HTMLAgility Pack...
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(r => r.Attributes.ToList().ConvertAll(i => i.Value)).SelectMany(j => j).ToList();
}
It works for me.

ABCD
- 7,914
- 9
- 54
- 90
-2
You can use an HTQL COM object and query the page using query: <a>:href
HTQLCOMLib.HtqlControl h = new HTQLCOMLib.HtqlControl();
string page = "<html><body><a href='test1.html'>test1</a><a href='test2.html'>test2</a> </body></html>";
h.setSourceData(page, page.Length);
h.setQuery("<a>: href ");
for (h.moveFirst(); 0 == h.isEOF(); h.moveNext() )
{
MessageBox.Show(h.getValueByIndex(1));
}
It will show messages of:
test1.html
test2.html

seagulf
- 380
- 3
- 5