Implementing Threads Into Java Web Crawler

Question

Here is the original web crawler in which i wrote: (Just for reference)

https://github.com/domshahbazi/java-webcrawler/tree/master

This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing more pages in less time. Below is each class:

Main class

public class controller {

    public static void main(String args[]) throws InterruptedException {

        DataStruc data = new DataStruc("http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12");

        Thread crawl1 = new Crawler(data);
        Thread crawl2 = new Crawler(data);

        crawl1.start();
        crawl2.start();
   }    
}

Crawler Class (Thread)

public class Crawler extends Thread {

    /** Instance of Data Structure **/
    DataStruc data;

    /** Number of page connections allowed before program terminates **/
    private final int INDEX_LIMIT = 10;

    /** Initial URL to visit **/
    public Crawler(DataStruc d) {
        data = d;
    }

    public void run() {

        // Counter to keep track of number of indexed URLS
        int counter = 0;

        // While URL's left to visit
        while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

            // Pop next URL to visit from stack
            String currentUrl = data.getURL();

            try {
                // Fetch and parse HTML document
                Document doc = Jsoup.connect(currentUrl)                 
                        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")
                        .referrer("http://www.google.com")
                        .timeout(12000)
                        .followRedirects(true)
                        .get();

                // Increment counter if connection to web page succeeds
                counter++;

                /** .select returns a list of elements (links in this case) **/
                Elements links = doc.select("a[href]"); // Relative URL

                // Add newly found links to stack
                addLinksToQueue(links);                             

            } catch (IOException e) {
                //e.printStackTrace();
                System.out.println("Error: "+currentUrl);
            }               
        }       
    }

    public void addLinksToQueue(Elements el) {
        // For each element in links
        for(Element e : el) {           

            String theLink = e.attr("abs:href"); // 'abs' prefix ensures absolute url is returned rather then relative url ('www.reddit.com/hello' rather then '/hello')

            if(theLink.startsWith("http") && !data.oldLink(theLink)) {
                data.addURL(theLink);
                data.addVisitedURL(theLink); // Register each unique URL to ensure it isnt stored in 'url_to_visit' again
                System.out.println(theLink);
            }               
        }   
    }
}

DataStruc Class

public class DataStruc {

    /** Queue to store URL's, can be accessed by multiple threads **/
    private ConcurrentLinkedQueue<String> url_to_visit = new ConcurrentLinkedQueue<String>();

    /** ArrayList of visited URL's **/
    private ArrayList<String> visited_url = new ArrayList<String>();

    public DataStruc(String initial_url) {
        url_to_visit.offer(initial_url);
    }

    // Method to add seed URL to queue
    public void addURL(String url) {
        url_to_visit.offer(url);
    }

    // Get URL at front of queue
    public String getURL() {
        return url_to_visit.poll();
    }

    // URL to visit size
    public int url_to_visit_size() {
        return url_to_visit.size();
    }

    // Add visited URL
    public void addVisitedURL(String url) {
        visited_url.add(url);
    }

    // Checks if link has already been visited
    public boolean oldLink(String link) {
        for(String s : visited_url) {
            if(s.equals(link)) {
                return true;
            }
        }   
        return false;
    }       
}

DataStruc is the shared data structure class, which will be concurrently accessed by each instance of a Crawler.java thread. DataStruc has a Queue to store links to be visited, and an arraylist to store visited URL's, to prevent entering a loop. I used a ConcurrentLinkedQueue to store the urls to be visited, as i see it takes care of concurrent access. I didnt require concurrent access with my arraylist of visited urls, as all i need to be able to do is add to this and iterate over it to check for matches.

My problem is that when i compare operation time of using a single thread VS using 2 threads (on the same URL), my single threaded version seems to work faster. I feel i have implemented the threading incorrectly, and would like some tips if anybody can pinpoint the issues?

Thanks!

Provide the current implementation you're using for multi threading in form of [SSCCE](http://sscce.org), otherwise we can't help you. Providing your code through external links is not acceptable in this site. — Luiggi Mendoza, Oct 14 '14 at 14:44
This question would be very odd in this times, but how many cores does your CPU have? — Luiggi Mendoza, Oct 14 '14 at 15:32
This may be down to your connection to the internet. If you are network bound then having several connections trying to use the connection may well slow things down. — Jaydee, Oct 14 '14 at 15:40
Well, since it will have at least 2 cores, then each thread will run in a core (assuming your OS handles multi threading well with no problems). Then your problem may lie in the I/O operation bound to your network access, as @Jaydee points out. — Luiggi Mendoza, Oct 14 '14 at 15:46
Issue may be in the test `while((data.url_to_visit_size() > 0)` Won't the 2nd thread fail there immediately? Since the 1st thread just polled? — user949300, Oct 14 '14 at 15:54
Your assumption on that DataStruc.visited_url doesn't need to handle concurrency is incorrect IMHO. What would happen if one thread calls oldLink and the other one calls addVisitedURL? — aliher, Oct 14 '14 at 16:10
Would it be better to create a master thread which visits websites, and also spawns off other threads to also visit websites. When there is say, 10 websites left in the list, i could terminate all the threads apart from the master? — Dom Shahbazi, Oct 22 '14 at 13:39

user949300 · Accepted Answer · 2014-10-14T16:04:07.557

1

Added: see my comment, I think the check in Crawler

// While URL's left to visit
        while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

is wrong. The 2nd Thread will stop immediately since the 1st Thread polled the only URL.

You can ignore the remaining, but left for history ...

My general approach to such types of "big blocks that can run in parallel" is:

Make each crawler a Callable. Probably Callable<List<String>>
Submit them to an ExecutorService
When they complete, take the results one at a time and add them to a List.

Using this strategy there is no need to use any concurrent lists at all. The disadvantage is that you don't get much live feedback as they are runnìng. And, if what they return is huge, you may need to worry about memory.

Would this suit your needs? You would have to worry about the addVisitedURL so you still need that as a concurrent data structure.

Added: Since you are starting with a single URL this strategy doesn't apply. You could apply it after the visit to the first URL.

edited Oct 14 '14 at 16:04

answered Oct 14 '14 at 15:35

user949300

15,364
7
35
66

*Using this strategy there is no need to use any concurrent lists at all. The disadvantage is that you don't get much live feedback as they are runnìng*. The first sentence is wrong, since you can use concurrent lists to feed data for the tasks (in form of `Runnable` or `Callable`). The second sentence is also wrong, you just have to define another concurrent structure shared for all tasks where each task can store messages or signals or another kind of info. Also, if you see OP's implementation, you won't gain much speed when changing to ThreadPool strategy. – Luiggi Mendoza Oct 14 '14 at 15:38
Each Callable has a single URL to process, so 1st sentence is correct. As for 2nd sentence, yes, you could define a concurrent structure for messages, but, if you don't, life is simpler, but less live feedback. However, since OP's crawlers do interact with each other some this approach may be less optimal. – user949300 Oct 14 '14 at 15:43

score 0 · Answer 2 · answered Sep 29 '16 at 17:22

class controller {

    public static void main(String args[]) throws InterruptedException {

        final int LIMIT = 4;

        List<String> seedList = new ArrayList<>(); //1
        seedList.add("https://www.youtube.com/");
        seedList.add("https://www.digg.com/");
        seedList.add("https://www.reddit.com/");
        seedList.add("https://www.nytimes.com/");

        DataStruc[] data = new DataStruc[LIMIT];

        for(int i = 0; i < LIMIT; i++){
            data[i] = new DataStruc(seedList.get(i)); //2
        }

        ExecutorService es = Executors.newFixedThreadPool(LIMIT);
        Crawler[] crawl = new Crawler[LIMIT];
        for(int i = 0; i < LIMIT; i++){
            crawl[i] = new Crawler(data[i]); //3
        }

        for(int i = 0; i < LIMIT; i++){
            es.submit(crawl[i]) // 4
        }
    }
}

you can try this out

create a seedlist
create objects of datastruc and add the seedlist to each of them
create crawl array and pass datastruc object to them one by one
pass the crawl object to the excutor

Implementing Threads Into Java Web Crawler

2 Answers2

Linked