Here is the original web crawler in which i wrote: (Just for reference)
https://github.com/domshahbazi/java-webcrawler/tree/master
This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing more pages in less time. Below is each class:
Main class
public class controller {
public static void main(String args[]) throws InterruptedException {
DataStruc data = new DataStruc("http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12");
Thread crawl1 = new Crawler(data);
Thread crawl2 = new Crawler(data);
crawl1.start();
crawl2.start();
}
}
Crawler Class (Thread)
public class Crawler extends Thread {
/** Instance of Data Structure **/
DataStruc data;
/** Number of page connections allowed before program terminates **/
private final int INDEX_LIMIT = 10;
/** Initial URL to visit **/
public Crawler(DataStruc d) {
data = d;
}
public void run() {
// Counter to keep track of number of indexed URLS
int counter = 0;
// While URL's left to visit
while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {
// Pop next URL to visit from stack
String currentUrl = data.getURL();
try {
// Fetch and parse HTML document
Document doc = Jsoup.connect(currentUrl)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")
.referrer("http://www.google.com")
.timeout(12000)
.followRedirects(true)
.get();
// Increment counter if connection to web page succeeds
counter++;
/** .select returns a list of elements (links in this case) **/
Elements links = doc.select("a[href]"); // Relative URL
// Add newly found links to stack
addLinksToQueue(links);
} catch (IOException e) {
//e.printStackTrace();
System.out.println("Error: "+currentUrl);
}
}
}
public void addLinksToQueue(Elements el) {
// For each element in links
for(Element e : el) {
String theLink = e.attr("abs:href"); // 'abs' prefix ensures absolute url is returned rather then relative url ('www.reddit.com/hello' rather then '/hello')
if(theLink.startsWith("http") && !data.oldLink(theLink)) {
data.addURL(theLink);
data.addVisitedURL(theLink); // Register each unique URL to ensure it isnt stored in 'url_to_visit' again
System.out.println(theLink);
}
}
}
}
DataStruc Class
public class DataStruc {
/** Queue to store URL's, can be accessed by multiple threads **/
private ConcurrentLinkedQueue<String> url_to_visit = new ConcurrentLinkedQueue<String>();
/** ArrayList of visited URL's **/
private ArrayList<String> visited_url = new ArrayList<String>();
public DataStruc(String initial_url) {
url_to_visit.offer(initial_url);
}
// Method to add seed URL to queue
public void addURL(String url) {
url_to_visit.offer(url);
}
// Get URL at front of queue
public String getURL() {
return url_to_visit.poll();
}
// URL to visit size
public int url_to_visit_size() {
return url_to_visit.size();
}
// Add visited URL
public void addVisitedURL(String url) {
visited_url.add(url);
}
// Checks if link has already been visited
public boolean oldLink(String link) {
for(String s : visited_url) {
if(s.equals(link)) {
return true;
}
}
return false;
}
}
DataStruc is the shared data structure class, which will be concurrently accessed by each instance of a Crawler.java thread. DataStruc has a Queue to store links to be visited, and an arraylist to store visited URL's, to prevent entering a loop. I used a ConcurrentLinkedQueue to store the urls to be visited, as i see it takes care of concurrent access. I didnt require concurrent access with my arraylist of visited urls, as all i need to be able to do is add to this and iterate over it to check for matches.
My problem is that when i compare operation time of using a single thread VS using 2 threads (on the same URL), my single threaded version seems to work faster. I feel i have implemented the threading incorrectly, and would like some tips if anybody can pinpoint the issues?
Thanks!