Im trying to get all the links from a website and put them in a List but sometimes im getting strange links why?

Question

This is the code to get the links:

private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
        {

            List<string> mainLinks = new List<string>();
            var linkNodes = document.DocumentNode.SelectNodes("//a[@href]");
            if (linkNodes != null)
            {
                foreach (HtmlNode link in linkNodes)
                {
                    var href = link.Attributes["href"].Value;
                    mainLinks.Add(href);
                }
            }
            return mainLinks;

        }

Sometimes the links im getting are starting like "/" or:

"/videos?feature=mh" Or "//www.youtube.com/my_videos_upload"

Im not sure if just "/" meaning a proper site or a site that start with "/videoes?... Or "//www.youtube...

I need to get each time the links from a website that start with http or https maybe just www also count as a proper site. The question is what i define as a proper site address and a link and whats not ?

Im sure my getLinks function is not good the code is not the proper way it should be.

This is how im adding the links to the List:

private List<string> test(string url, int levels , DoWorkEventArgs eve)
        {
                HtmlAgilityPack.HtmlDocument doc;
                HtmlWeb hw = new HtmlWeb();
                List<string> webSites;// = new List<string>();
                List<string> csFiles = new List<string>();

                                               try
                {
                    doc = hw.Load(url);
                    webSites = getLinks(doc);

webSites is a List After few times i see in the List sites like "/" or as above "//videoes... or "//www....

I need to get a link but what a link means a link ? "/" is a link ? Im sure just a "/" is nothing. But "//www....is that a proper link or not ? If everything was starting as "//www then i could say maybe to add http before it but the first link im getting in the List is just "/" — Daniel Lip, Sep 13 '12 at 02:01
Check the following possible duplicate questions: http://stackoverflow.com/questions/7578620/anchor-a-link-to-base-url and http://stackoverflow.com/questions/9646407/two-forward-slashes-in-a-url-src-href-attribute — rikitikitik, Sep 13 '12 at 02:05
@rikitikitik: Neither of the two questions you linked are anywhere close to duplicates of this one. Please read both the question and the links again. — Ken White, Sep 13 '12 at 02:06
@KenWhite He asked what the "/" and "//" are in the anchors and those links answered what those are. — rikitikitik, Sep 13 '12 at 02:09
@rikitikitik: I read the question differently. I read it as "Why is my code not returning what I think are proper links? Is it working as it should, or are the results proper links?". The other two just ask to explain URLs, and couldn't reasonably be expected to match this one in a search. — Ken White, Sep 13 '12 at 02:15
@KenWhite I checked his code and it looked reasonable. I felt it was reasonable to just point to him some explanations on what the "weird" links were. I could change "possible duplicate" to "possible related" in my original comment, but it's too late for that now. — rikitikitik, Sep 13 '12 at 02:23
@rikitikitik: If you had said "You might want to see these links", I wouldn't have said anything. Sometimes, though, a comment about "duplicates" can lead to others just voting to close for that reason without checking the links; I didn't want that to happen here, because I don't think duplicates applies. :-) — Ken White, Sep 13 '12 at 02:34

score 0 · Answer 1 · answered Sep 13 '12 at 02:04

0

not sure if understood your question but

/Videos means it is accessing Videos folder from the root of the host you are accessing

ex:

www.somesite.com/Videos

answered Sep 13 '12 at 02:04

RollRoll

8,133
20
76
135

score 0 · Answer 2 · answered Sep 13 '12 at 02:06

There are absolute and relative Urls - so you are getting different flavors from different links, you need to make them absolute url appropriately (Uri class mostly will handle it for you).

foo/bar.txt - relative url from the same path as current page
../foo/bar.txt - relative path from one folder above current
/foo/bar.txt - server-relative pat from root - same server, path starting from root
//www.sample.com/foo/bar.txt - absolute url with the same scheme (http/https) as current page
http://www.sample.com/foo/bar.txt - complete absolute url

score 0 · Answer 3 · answered Sep 13 '12 at 02:11

It looks like you are using a library that is able to parse/read html tags.

For my understanding

var href = link.Attributes["href"].Value;

is doing nothing but reading the value of the "href" attribute.

So assuming the website's source code is using links like href="/news" it will grab and save even the relative links to your list.

Just view the target website's sourcecode and check it against your results.

Im trying to get all the links from a website and put them in a List but sometimes im getting strange links why?

3 Answers3