0

I am attempting to create simple crawler in Jsoup. It finds all links in website's source code and eventually follows them, again searching for new links in each of them and so on.

Problem is that computational time is quite long after getting over two redirects deep.

enter image description here

This is the pseudocode of how it works:

function follow_links(String[] links)
{

 for(int i=0; i<=links.amount-1; i++)
 {
   Document doc = Jsoup.connect(links[i]);
   String[] newlinks = new String[max];
   newlinks = parse(doc);
   ...       
 }

}

My question is, whether the code would be faster if I created new Thread in each iteration of loop, so all connections would be established in parallel. It takes some time for connect function to return so I suppose queue is formed. Can threading solve such issue?

John Smith
  • 401
  • 6
  • 17

1 Answers1

0

I don't think so. If your application was going through 1000 links you will create 1000 threads which will kill your whole application performance. Check this up to learn more LINK ABOUT THREADS

It is better to keep a pool of threads and your 'searching' submit as a task. CompletableFuture with ExecutorServices seems to be the solution.

Here is the topic on Stack where a guy tries to implement the same idea into his crawler. LINK

Emil Hotkowski
  • 2,233
  • 1
  • 13
  • 19