This is the code to get the links:
private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
{
List<string> mainLinks = new List<string>();
var linkNodes = document.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
var href = link.Attributes["href"].Value;
mainLinks.Add(href);
}
}
return mainLinks;
}
Sometimes the links im getting are starting like "/" or:
"/videos?feature=mh" Or "//www.youtube.com/my_videos_upload"
Im not sure if just "/" meaning a proper site or a site that start with "/videoes?... Or "//www.youtube...
I need to get each time the links from a website that start with http or https maybe just www also count as a proper site. The question is what i define as a proper site address and a link and whats not ?
Im sure my getLinks function is not good the code is not the proper way it should be.
This is how im adding the links to the List:
private List<string> test(string url, int levels , DoWorkEventArgs eve)
{
HtmlAgilityPack.HtmlDocument doc;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;// = new List<string>();
List<string> csFiles = new List<string>();
try
{
doc = hw.Load(url);
webSites = getLinks(doc);
webSites is a List After few times i see in the List sites like "/" or as above "//videoes... or "//www....