Traversing URLs or pages to find 404 links

Question

I have a working code which traverses one level of URL, I need some Help to implement two or three level of link traversing to detect 404's.

    driver().navigate().to(URL);
    driver().manage().window().maximize();
    String orgWindow = driver().getWindowHandle();

    List<WebElement> linksList = driver().findElements(By.tagName("a"));

    for (WebElement linkElement : linksList) {

        System.out.println("================ At First Level =================");

        String link = linkElement.getAttribute("href");
        if (link != null && link.contains("test")) {

            verifyLinkActive(link); //This method has HTTP URL connection to detect for 404's

            // Second Level Traversing.....
            driver().navigate().to(link);
            driver().manage().window().maximize();

            List<WebElement> SecondLinkList = driver().findElements(By.tagName("a"));

            for (WebElement linkSecondElement : SecondLinkList) {

                System.out.println("================ At Second Level =================");

                String Secondlink = linkSecondElement.getAttribute("href");
                if (Secondlink != null && Secondlink.contains("test")) {

                    verifyLinkActive(Secondlink);

                }// SecondIF

            }//Second for


        }//if

        driver().switchTo().window(orgWindow);  //Switching back to Original window


    } //for

My Questions - 1) Is it the right way I have implemented for second or third level of iteration to find 404's. 2) Also is there a way I can ignore certain links which fall with specific tags or ID's , coz these standard links are repetitive and are found on each page and if possible i can ignore these...

looking forward to some inputs!!

score 0 · Answer 1 · answered Sep 04 '14 at 17:16

If you mean how to structure the program itself, maybe the easiest way is to keep a list of URLs to check (to-check-urls), and a set of already checked URLs (checked-urls).

When your program start, the to-check-urls contains only the first page to visit, and the checked-urls is obviously empty.

Then you have a single loop that repeats until the list of URLs to check is empty, and does this :

If the list is empty, exit, you finished
Take one url from to-check-urls and remove it
If the URL is already present in checked-urls, return to 1
Add the url to checked-urls
Open the url as you already do
If it's a 404, report the error as you prefer and return to 1
Parse the HTML as you already do
Put all the found urls in the to-check-urls
Return to 1

The code is mostly there, just need to arrange it in a loop using the two lists. This way, you don't check an url twice, and don't care if they are second or third or fourth level, also because a site is a graph and not a tree, so no matter how many levels you add there could still be more.

How would I ignore links between specific TAGs, eg: Ignore all links falling within

tags? — QualityThoughts, Sep 04 '14 at 18:54
Instead of doing By.tagName("a") try By.xpath("body//a") or By.cssSelector("body a") — Simone Gianni, Sep 05 '14 at 14:25

Traversing URLs or pages to find 404 links

1 Answers1