How to crawl multiple urls using jsoup

Question

I have the code below which crawls a website using JSoup, but I want to crawl multiple URLs at the same time. I stored the URLs in an array but I can't get it to work. And how can this code be implemented in multithreading if i want to use it? Is multithreading good for such application?

public class Webcrawler {
    public static void main(String[] args) throws IOException {

        String [] url = {"http://www.dmoz.org/","https://docs.oracle.com/en/"}; 
        //String [] url = new String[3];
        //url[0] = "http://www.dmoz.org/";
        //url[1] = "http://www.dmoz.org/Computers/Computer_Science/";
        //url[2] = "https://docs.oracle.com/en/";

        for(String urls : url){
            System.out.print("Sites to be crawled\n " + urls);
        }
        //String url = "http://www.dmoz.org/";
        print("\nFetching %s...", url);

        Document doc = Jsoup.connect(url[0]).get();
        org.jsoup.select.Elements links = doc.select("a");
        //doc.select("a[href*=https]");//(This is the one you are looking for)selects if value of href contatins https
        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
            print(" (%s)", link.absUrl("href") /*link.attr("href")*/, trim(link.text(), 35));     
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}

i found this very helpful.Please take some time to go through....http://mrbool.com/how-to-create-a-web-crawler-and-storing-data-using-java/28925 — SmashCode, Mar 11 '16 at 12:03
"I can't get it to work" is not a problem description. What *precisely* was your problem? — Raedwald, Mar 11 '16 at 13:48
@Raedwald, my problem is i want to create a web crawler that can crawl multiple website at the same time, right now it only crawl single site/url.i want to store the urls in an array and crawl them — Chihab Adam, Mar 11 '16 at 23:37

score 5 · Answer 1 · answered Mar 11 '16 at 12:45

You can both use multi-threading and crawl multiple websites at the same time. The following code does what you need. I'm pretty sure that it can be improved a lot (e.g. by using an Executor), but I wrote it very quickly.

public class Main {

    public static void main(String[] args) {

        String[] urls = new String[]{"http://www.dmoz.org/", "http://www.dmoz.org/Computers/Computer_Science/", "https://docs.oracle.com/en/"};

        // Create and start workers
        List<Worker> workers = new ArrayList<>(urls.length);
        for (String url : urls) {
            Worker w = new Worker(url);
            workers.add(w);
            new Thread(w).start();
        }

        // Retrieve results
        for (Worker w : workers) {
            Elements results = w.waitForResults();
            if (results != null)
                System.out.println(w.getName()+": "+results.size());
            else
                System.err.println(w.getName()+" had some error!");
        }
    }
}

class Worker implements Runnable {

    private String url;
    private Elements results;
    private String name;
    private static int number = 0;

    private final Object lock = new Object();

    public Worker(String url) {
        this.url = url;
        this.name = "Worker-" + (number++);
    }

    public String getName() {
        return name;
    }

    @Override
    public void run() {
        try {
            Document doc = Jsoup.connect(this.url).get();

            Elements links = doc.select("a");

            // Update results
            synchronized (lock) {
                this.results = links;
                lock.notifyAll();
            }
        } catch (IOException e) {
            // You should implement a better error handling code..
            System.err.println("Error while parsing: "+this.url);
            e.printStackTrace();
        }
    }

    public Elements waitForResults() {
        synchronized (lock) {
            try {
                while (this.results == null) {
                    lock.wait();
                }
                return this.results;
            } catch (InterruptedException e) {
                // Again better error handling
                e.printStackTrace();
            }

            return null;
        }
    }
}

The synchronization between wrokers could be rewritten to use a ReentrantLock. (See: http://stackoverflow.com/a/11821900/363573) — Stephan, Mar 11 '16 at 21:04
how do i print the crawled urls, right now the code only prints how many urls are crawled — Chihab Adam, Mar 11 '16 at 23:35
Well you should replace `results.size()` with the operation you want to execute, such as `for (Element result : results) { result.absUrl("href") }` — user2340612, Mar 12 '16 at 22:13

How to crawl multiple urls using jsoup

1 Answers1