Web Crawler using jsoup

Question

I am developing a web crawler but I got stuck, because I cannot get all the reachable links, here is my code:

public class SNCrawler extends Thread {

    Specific s;

    HashSet<String> hs = new HashSet<String>();
    public SNCrawler(Specific s)
    {
        this.s = s;
    }

    public void crawl(String url) throws IOException {

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a");

        for (Element link : links)
        {
            if(isSuitable(link.attr("href")) && !hs.contains(link.attr("abs:href")))
            {
                hs.add(link.attr("href"));
                crawl(link.attr("href"));

            }
        }

    }

    public boolean isSuitable(String site)
    {
        boolean myBool = false;
        if(site.startsWith("http://www.svensktnaringsliv.se/") && !SNFilter.matcher(site).matches())
            if(site.contains(".pdf")) {
                hs.add(site);
                myBool=true;
            }else{
                hs.add(site);
                myBool=true;
            }
        return myBool;

    }

    private static final Pattern SNFilter = Pattern.compile(".*((/staff/|medarbetare|play|/member_organizations/|/sme_committee/|rm=print|/contact/|/brussels-office/|/about-us|/newsletter/|/advantagesweden/|service=print|#)).*");

    @Override
    public void run()
    {
        try {
            crawl("http://www.svensktnaringsliv.se/english/");
            for(String myS : hs)
            {
                System.out.println(myS);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

when the program reaches this part of the website it doesn get any links from there, is the same things for this page, from there I get only 2 or 3 links, I have looked at the code for many hours but cant really figute it out why I got stuck

I haven't used jsoup before, so can't really give you the answer I'm afraid, but was wondering why you don't just use something like regex and phps `preg_match_all` function. `/ — Alexander Holman, Apr 20 '16 at 13:00
sorry this is a more accurate pattern (or so I have found) `/]*href=[\"||\']([^\"\']*)[\"||\'][^>]*>/ixgU` — Alexander Holman, Apr 20 '16 at 13:32

score 1 · Accepted Answer · answered Apr 21 '16 at 11:48

when the program reaches this part of the website it doesn get any links from there

The crawl function should work with absolute urls only. Try the function below instead:

public void crawl(String url) throws IOException {
    Document doc = Jsoup.connect(url).get();
    Elements links = doc.select("a");

    for (Element link : links) {
        String foundUrl = link.attr("abs:href").toLowerCase();

        if( isSuitable(foundUrl) && ( !hs.contains(foundUrl) ) ) {
            hs.add(foundUrl);
            crawl(foundUrl);
        }
    }
}

Web Crawler using jsoup

1 Answers1