I am developing a web crawler but I got stuck, because I cannot get all the reachable links, here is my code:
public class SNCrawler extends Thread {
Specific s;
HashSet<String> hs = new HashSet<String>();
public SNCrawler(Specific s)
{
this.s = s;
}
public void crawl(String url) throws IOException {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a");
for (Element link : links)
{
if(isSuitable(link.attr("href")) && !hs.contains(link.attr("abs:href")))
{
hs.add(link.attr("href"));
crawl(link.attr("href"));
}
}
}
public boolean isSuitable(String site)
{
boolean myBool = false;
if(site.startsWith("http://www.svensktnaringsliv.se/") && !SNFilter.matcher(site).matches())
if(site.contains(".pdf")) {
hs.add(site);
myBool=true;
}else{
hs.add(site);
myBool=true;
}
return myBool;
}
private static final Pattern SNFilter = Pattern.compile(".*((/staff/|medarbetare|play|/member_organizations/|/sme_committee/|rm=print|/contact/|/brussels-office/|/about-us|/newsletter/|/advantagesweden/|service=print|#)).*");
@Override
public void run()
{
try {
crawl("http://www.svensktnaringsliv.se/english/");
for(String myS : hs)
{
System.out.println(myS);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
when the program reaches this part of the website it doesn get any links from there, is the same things for this page, from there I get only 2 or 3 links, I have looked at the code for many hours but cant really figute it out why I got stuck