0

I have a working code which traverses one level of URL, I need some Help to implement two or three level of link traversing to detect 404's.

    driver().navigate().to(URL);
    driver().manage().window().maximize();
    String orgWindow = driver().getWindowHandle();

    List<WebElement> linksList = driver().findElements(By.tagName("a"));

    for (WebElement linkElement : linksList) {

        System.out.println("================ At First Level =================");

        String link = linkElement.getAttribute("href");
        if (link != null && link.contains("test")) {

            verifyLinkActive(link); //This method has HTTP URL connection to detect for 404's

            // Second Level Traversing.....
            driver().navigate().to(link);
            driver().manage().window().maximize();

            List<WebElement> SecondLinkList = driver().findElements(By.tagName("a"));

            for (WebElement linkSecondElement : SecondLinkList) {

                System.out.println("================ At Second Level =================");

                String Secondlink = linkSecondElement.getAttribute("href");
                if (Secondlink != null && Secondlink.contains("test")) {

                    verifyLinkActive(Secondlink);

                }// SecondIF

            }//Second for


        }//if

        driver().switchTo().window(orgWindow);  //Switching back to Original window


    } //for

My Questions - 1) Is it the right way I have implemented for second or third level of iteration to find 404's. 2) Also is there a way I can ignore certain links which fall with specific tags or ID's , coz these standard links are repetitive and are found on each page and if possible i can ignore these...

looking forward to some inputs!!

QualityThoughts
  • 79
  • 1
  • 2
  • 9

1 Answers1

0

If you mean how to structure the program itself, maybe the easiest way is to keep a list of URLs to check (to-check-urls), and a set of already checked URLs (checked-urls).

When your program start, the to-check-urls contains only the first page to visit, and the checked-urls is obviously empty.

Then you have a single loop that repeats until the list of URLs to check is empty, and does this :

  1. If the list is empty, exit, you finished
  2. Take one url from to-check-urls and remove it
  3. If the URL is already present in checked-urls, return to 1
  4. Add the url to checked-urls
  5. Open the url as you already do
  6. If it's a 404, report the error as you prefer and return to 1
  7. Parse the HTML as you already do
  8. Put all the found urls in the to-check-urls
  9. Return to 1

The code is mostly there, just need to arrange it in a loop using the two lists. This way, you don't check an url twice, and don't care if they are second or third or fourth level, also because a site is a graph and not a tree, so no matter how many levels you add there could still be more.

Simone Gianni
  • 11,426
  • 40
  • 49