Is it able to retrieve website content by Crawler4j?

Question

I m very new to Java.

Now , I want to retrieve the news article contents using Google news search -keyword: "toy" from page 1 to page 10.

https://www.google.com/search?q=toy&biw=1366&bih=645&tbm=nws&source=lnms&sa=X&ved=0ahUKEwiTp82syoXPAhUMkpQKHawZBOoQ_AUICygE

That is retrieving 100 news content from page 1 - page 10. (assuming 10 news article in every pages)

After I read this Crawler4j vs. Jsoup for the pages crawling and parsing in Java

I decide to use Crawler4j as it can

Give base URI (home page)

Take all the URIs from each page and retrieve the contents of those too.

Move recursively for every URI you retrieve.

Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).

In my case , I can give the google search page from p1 to p10 .And it returns the 100 news article if I set intnumberOfCrawlers=1

However , when i try the Quickstart of Crawler4j example

It only returns the external links found from the orginal link . Like these:

    URL: http://www.ics.uci.edu/~lopes/
Text length: 2619
Html length: 11656
Number of outgoing links: 38
URL: http://www.ics.uci.edu/~welling/
Text length: 4503
Html length: 23713
Number of outgoing links: 24
URL: http://www.ics.uci.edu/~welling/teaching/courses.html
Text length: 2222
Html length: 15138
Number of outgoing links: 33
URL: http://www.ics.uci.edu/
Text length: 3661
Html length: 51628
Number of outgoing links: 86

Hence , I wonder can crawler4j perform the function I raised . Or should I use crawler4j +JSouptogether ?

rzo1 · Answer 1 · 2023-03-01T14:41:37.730

4

crawler4j respects crawler politness such as the robots.txt. In your case this file is the following one.

Inspecting this file reveals, that it is disallowed to crawl your given seed points:

 Disallow: /search

So you won't be able to crawl the given site, unless you modify the classes to ignore robots.txt. However, this is not considered polite and is not compliant with crawler ethics.

edited Mar 01 '23 at 14:41

answered Sep 19 '16 at 11:18

rzo1

5,561
3
25
64

polite my ass, got an updated link for disabling robots.txt? – Krusty the Clown Mar 01 '23 at 02:40

score 0 · Answer 2 · answered Jan 09 '17 at 14:40

There is a lot of questions on your post I will try my best to answer:

"Is it able to retrieve website content by Crawler4j?"

Yes it can as demonstrated by the example on the github source code
However for for more advance DOM parsing/manipulation I will encourage you to add Jsoup. Here is the documentation for Jsoup

"Hence , I wonder can crawler4j perform the function I raised . Or should I use crawler4j +JSouptogether ?"

Use Crawler4j for what it is great at most, Crawling
Use Jsoup for extracting and manipulating data via a convenient API

"It only returns the external links found from the orginal link . Like these"

In the BasicCrawler you'll need to add the allow urls here return href.startsWith("http://www.ics.uci.edu/"); modify to include more
In the BasicCrawlController you'll need to add your page seeds here config.setMaxDepthOfCrawling(2);

Is it able to retrieve website content by Crawler4j?

2 Answers2