Iterate through all links of a website using Selenium

Question

I'm new to Selenium and I would like to download all the pdf, ppt(x) and doc(x) files from a website. I have written the following code. But I'm confused how to get the inner links:

import java.io.*;
import java.util.ArrayList;
import java.util.List;

import org.apache.commons.io.FileUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.TakesScreenshot;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;

public class WebScraper {

    String loginPage = "https://blablah/login";
    static String userName = "11";
    static String password = "11";
   static String mainPage = "https://blahblah";

    public WebDriver driver = new FirefoxDriver();
    ArrayList<String> visitedLinks = new ArrayList<>();

    public static void main(String[] args) throws IOException {

        System.setProperty("webdriver.gecko.driver", "E:\\geckodriver.exe");

        WebScraper webSrcaper = new WebScraper();
        webSrcaper.openTestSite();
        webSrcaper.login(userName, password);

        webSrcaper.getText(mainPage);
        webSrcaper.saveScreenshot();
        webSrcaper.closeBrowser();
    }

        /**
     * Open the test website.
     */
    public void openTestSite() {

        driver.navigate().to(loginPage);
    }

    /**
     * @param username
     * @param Password Logins into the website, by entering provided username and password
     */
    public void login(String username, String Password) {

        WebElement userName_editbox = driver.findElement(By.id("IDToken1"));
        WebElement password_editbox = driver.findElement(By.id("IDToken2"));
        WebElement submit_button = driver.findElement(By.name("Login.Submit"));

        userName_editbox.sendKeys(username);
        password_editbox.sendKeys(Password);
        submit_button.click();

    }

    /**
     * grabs the status text and saves that into status.txt file
     *
     * @throws IOException
     */
    public void getText(String website) throws IOException {

        driver.navigate().to(website);

        try {
            Thread.sleep(10000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        List<WebElement> allLinks = driver.findElements(By.tagName("a"));

        System.out.println("Total no of links Available: " + allLinks.size());

        for (int i = 0; i < allLinks.size(); i++) {

            String fileAddress = allLinks.get(i).getAttribute("href");

            System.out.println(allLinks.get(i).getAttribute("href"));
            if (fileAddress.contains("download")) {
                driver.get(fileAddress);
            } else {
//                getText(allLinks.get(i).getAttribute("href"));
            }
        }
        
    }

    /**
     * Saves the screenshot
     *
     * @throws IOException
     */
    public void saveScreenshot() throws IOException {
        File scrFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        FileUtils.copyFile(scrFile, new File("screenshot.png"));
    }

    public void closeBrowser() {
        driver.close();
    }
    
}

I have an if clause which checks if the current link is a downloadable file (with an address including the word "download"). If it is, I will get it, if not, what to do? That part is my problem. I tried to implement a recursive function to retrieve the nested links and repeat the steps for the nested links, but no success.

In the meantime, the first link which is found when giving https://blahblah as the input, is https://blahblah/# which refers to the same page as https://blahblah. It can also cause a problem, but currently, I'm trapped in another problem, namely the implementation of the recursion function. Could you please help me?

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the [How to Ask](https://stackoverflow.com/help/how-to-ask) page for help clarifying this question. — undetected Selenium, Nov 07 '17 at 14:30
@DebanjanB Sorry, I haven't asked multiple questions. My question is simple and straightforward: How can I iterate through all links of a website using Selenium? I don't understand what other question you see in my post. Could you please explain? — user1419243, Nov 07 '17 at 14:41
@user1419243 just try removing `else` statement, so just give `for (int i = 0; i < allLinks.size(); i++) { String fileAddress = allLinks.get(i).getAttribute("href"); System.out.println(allLinks.get(i).getAttribute("href")); if (fileAddress.contains("download")) { driver.get(fileAddress); } }` — user1207289, Nov 07 '17 at 14:56
@DebanjanB Sorry again, I think you haven't read my code. I DO already downloading the pdf files and in the text that you copied, I'm just "explaining" my code. I'm not asking anything. — user1419243, Nov 07 '17 at 14:58
@user1207289 Thanks for your answer. The problem with this approach is that it tried to download the files "only" on the first page and doesn't search in the inner links. And I actually don't have any file in the main page. So, I need to crawl much deeper. — user1419243, Nov 07 '17 at 14:59
So what do you want to do if `fileAddress` does not contain the word "download" ? it wasn't clear in your question — user1207289, Nov 07 '17 at 15:06
Do you mean `allLinks = driver.findElements(By.tagName("a"));` does not fetch all links in the page? — user1207289, Nov 07 '17 at 15:16
@user1207289 If the fileAddress doesn't include the word "download", it means that it is a normal link. So, I need to click on that and do the same process (check all the links again and see if there is something which is downloadable). allLinks fetches all the links of the "current" page, yes. But not the links of the inner pages. — user1419243, Nov 07 '17 at 19:13
@user1419243 have a look at my answer below. It can be modified to look for word "download" and then click on links that do not contain that. One way is to put all links that do not have "download" in a `collection` and iterate through recursion. — user1207289, Nov 07 '17 at 19:48

aolisa · Answer 1 · 2017-11-07T17:09:07.197

You are not far off, but answering your question, grab all the link into a list of elements, iterate and click(and wait). Using C# something like this;

       IList<IWebElement> listOfLinks = _driver.FindElements(By.XPath("//a"));
        foreach (var link in listOfLinks)
        {
            if(link.GetAttribute("href").Contains("download"))
            {
            link.Click();
            WaitForSecs(); //Thread.Sleep(1000)
            }
        }

JAVA

    List<WebElement> listOfLinks = webDriver.findElements(By.xpath("//a"));
    for (WebElement link :listOfLinks ) {

        if(link.getAttribute("href").contains("download"))
        {
            link.click();
            //WaitForSecs(); //Thread.Sleep(1000)
        }
    }

user1207289 · Answer 2 · 2017-11-10T15:27:50.853

One option is to embed groovy in your java code if you want to search depth-first. When httpBuilder parses , it gives xml like documentation and then you can traverse as deep as you like using gpath in groovy. Your test.groovy is like below

@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )

import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
import groovy.json.JsonSlurper

urlValue="http://yoururl.com"

def http = new HTTPBuilder(urlValue) 

//parses page and provide xml tree , it even includes malformed html 
def parsedText = http.get([:])

// number of a tags. "**" will parse depth-first
aCount= parsedText."**".findAll {it.name()=='a'}.size()

Then you just call test.groovy from java like this

 static void runWithGroovyShell() throws Exception {
    new GroovyShell().parse( new File( "test.groovy" ) ).invokeMethod( "hello_world", null ) ;
  }

More info on parsing html with groovy

Addition: When you evaluate groovy within Java, to access groovy variables in Java environment through groovy bindings, have a look here

Iterate through all links of a website using Selenium

2 Answers2

JAVA