-2

I need to write a code which will get all the links in a website recursively. Since I'm new to this is what I've got so far;

List<WebElement> no = driver.findElements(By.tagName("a"));
nooflinks = no.size();
for (WebElement pagelink : no)
{
    String linktext = pagelink.getText();
    link = pagelink.getAttribute("href"); 
}

Now what I need to do is if the list finds a link of the same domain, then it should get all the links from that URL and then return back to the previous loop and resume from the next link. This should go on till the last URL in the Whole Website is found. That is for example, Home Page is base URL and it has 5 URLs of other pages, then after getting the first of the 5 URLs the loop should get all the links of that first URL return back to Home Page and resume from second URL. Now if second URL has Sub-sub URL, then the loop should find links for those first then resume to second URL and then go back to Home Page and resume from third URL.

Can anybody help me out here???

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
IAmMilinPatel
  • 413
  • 1
  • 11
  • 19
  • http://stackoverflow.com/questions/5913613/standard-java-class-for-common-url-uri-manipulation has some information about manipulating URL's, which may be helpful if you're trying to figure out whether a link is in the same domain. No guarantees, I haven't looked into it any further – ajb Jul 05 '14 at 20:13

3 Answers3

2

I saw this post recently. I don't know if you are still looking for ANY solution for this problem. If not, I thought it might be useful:

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.Iterator;
public class URLReading {
public static void main(String[] args) {
 try {
    String url="";
    HashMap<String, String> h = new HashMap<>(); 
    Url = "https://abidsukumaran.wordpress.com/";
    Document doc = Jsoup.connect(url).get();
  
    //  Page Title
    String title = doc.title();
   //System.out.println("title: " + title);
 
  //  Links in page
  Elements links = doc.select("a[href]");
  List url_array = new ArrayList();
  int i=0;
  url_array.add(url);
  String root = url;
  h.put(url, title);
  Iterator<String> keySetIterator = h.keySet().iterator();
  while((i<=h.size())){
      try{
          url = url_array.get(i).toString();
      doc = Jsoup.connect(url).get();
      title = doc.title();
      links = doc.select("a[href]");
      
    for (Element link : links) {
         
   String res= h.putIfAbsent(link.attr("href"), link.text());
   if (res==null){
   url_array.add(link.attr("href"));
   System.out.println("\nURL: " + link.attr("href"));
   System.out.println("CONTENT: " + link.text());
   }
  } 
 }catch(Exception e){
        System.out.println("\n"+e);
      }
 
      i++;
 
     }
     } catch (Exception e) {
     e.printStackTrace();
     }
    }
   }
0

You can use Set and HashSet. You may try like this:

Set<String> getLinksFromSite(int Level, Set<String> Links) {
    if (Level < 5) {
        Set<String> locallinks =  new HashSet<String>();
        for (String link : Links) {
            Set<String> new_links = ;
            locallinks.addAll(getLinksFromSite(Level+1, new_links));
        }
        return locallinks;
    } else {
        return Links;
    }

}
Rahul Tripathi
  • 168,305
  • 31
  • 280
  • 331
  • Hello R.T., Do I place this code withing the existing for loop in my code? or should I replace my code with this altogether? – IAmMilinPatel Jul 06 '14 at 10:02
0

I would think the following idiom would be useful in this context:

Set<String> visited = new HashSet<>();
Deque<String> unvisited = new LinkedList<>();

unvisited.add(startingURL);
while (!unvisited.isEmpty()) {
    String current = unvisited.poll();
    visited.add(current);
    for /* each link in current */ {
        if (!visited.contains(link.url())
            unvisited.add(link.url());
    }
}
sprinter
  • 27,148
  • 6
  • 47
  • 78