8

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved.

I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect to compare the two?

Thanks.

clever_bassi
  • 2,392
  • 2
  • 24
  • 43

1 Answers1

16

Crawler4J is a crawler, Jsoup is a parser. Actually you could/should use both. Crawler4J is an easy-multithreaded interface to get all the urls and all the pages(content) of the site you want. After that you can use Jsoup in order to parse the data, with amazing (jquery-like) css selectors and actually do something with it. Of course you have to consider dynamic (javascript generated) content. If you want that content too, then you have to use something else that includes a javascript engine (headless browser + parser) like htmlunit or webdriver (selenium), that will execute javascript before parsing the content.

Alkis Kalogeris
  • 17,044
  • 15
  • 59
  • 113
  • 1
    I thought exactly the same. I would actually need both a crawler and a parser. Crawler could be crawler4j but for parser I'm dubious. JSoup is a lot "Groovier" than other parsers. Htmlunit fails in several cases which have "anything beyond trivial" javascript. Also, from user reviews, its apparent that it works on < 50% of websites. – clever_bassi Jun 24 '14 at 13:31
  • 1
    Maybe webdriver then. I haven't used it, but I've heard excellent things. – Alkis Kalogeris Jun 24 '14 at 14:41
  • 1
    I have been looking into integrating selenium web driver with JSoup. Thanks for the suggestion. – clever_bassi Jun 24 '14 at 15:37
  • I just tried `Jsoup` and paid attention that it can be also used as crawler — to retrieve the web-page content. Thus, can you, please, clarify the difference between `Crawler4J` and `Jsoup`. – Mike Jan 19 '16 at 22:17
  • It can, but it's missing a lot of functionality that a crawler should have. For example I have the base uri of a site. 1) How will I retrieve all the contents of this site with Jsoup? 2) Some pages are circular, how can I avoid getting the same url twice? 3) I want it multithreaded. 4) The site is vast, I want it to go only 50 links deep. All this can be done with Jsoup, but you have to implement them yourself. Crawler4J and any crawler microframework has this functionality available. – Alkis Kalogeris Jan 20 '16 at 05:06
  • 1
    Jsoup has a very simple interface to get the content. This provides you a way of implementing a crawler, but doesn't make Jsoup a crawler. Its main purpose is for doing what comes after the crawling, which is the parsing. – Alkis Kalogeris Jan 20 '16 at 05:11