I m very new to Java.
Now , I want to retrieve the news article contents using Google news search -keyword: "toy" from page 1 to page 10.
That is retrieving 100 news content from page 1 - page 10. (assuming 10 news article in every pages)
After I read this Crawler4j vs. Jsoup for the pages crawling and parsing in Java
I decide to use Crawler4j as it can
Give base URI (home page)
Take all the URIs from each page and retrieve the contents of those too.
Move recursively for every URI you retrieve.
Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).
In my case , I can give the google search page from p1 to p10 .And it returns the 100 news article if I set intnumberOfCrawlers=1
However , when i try the Quickstart of Crawler4j example
It only returns the external links found from the orginal link . Like these:
URL: http://www.ics.uci.edu/~lopes/
Text length: 2619
Html length: 11656
Number of outgoing links: 38
URL: http://www.ics.uci.edu/~welling/
Text length: 4503
Html length: 23713
Number of outgoing links: 24
URL: http://www.ics.uci.edu/~welling/teaching/courses.html
Text length: 2222
Html length: 15138
Number of outgoing links: 33
URL: http://www.ics.uci.edu/
Text length: 3661
Html length: 51628
Number of outgoing links: 86
Hence , I wonder can crawler4j
perform the function I raised . Or should I use crawler4j
+JSoup
together ?