0

I am implementing a search engine in java and I'm using the Jsoup API to make the crawler component and there's two things I still haven't understood quite yet. First: to fetch a web page, i.e. from wikipedia site I call the Jsoup.connect() function like this

private static final String agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) "
        + "AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";

Document htmlDocument = Jsoup.connect(url).userAgent(agent).get();

There are some crawlers whose user agent is blocked on the site, as it was established on robots.txt file. In this case, if I define the user agent for the connection requerer as a web browser, the site allows access to any of its pages. I would like to know how this is possible; let's say if Jsoup fetches the robots.txt file and internally crawls the site respecting the rules in it, is this really implemented? And if so, how's that? what's the logic behind?

The second thing is the DNS resolver. I already looked in this topic java - Use IP and host with Jsoup that when the system propriety sun.net.http.allowRestrictedHeaders is setted to true, I've understood that it allows Jsoup to change the header of the GET request to use IP instead of URL. Is that correct or the logic is something else? And just like in my first question, what happened internally?

If anyone can answer at least one of this questions, I appreciate very much. Meanwhile, I'll study the Jsoup code in github to see if I'm letting something pass.

Community
  • 1
  • 1

1 Answers1

0

Why don't you use a proper crawler like StormCrawler or Apache Nutch. They follow robots.txt, enforce politeness etc... StormCrawler uses JSoup for parsing HTML documents.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
  • Actually, I didn't know any of these crawlers... I was like searching for something to help me crawl the web and Jsoup was the first one I found. And I also know that there are several better ways to do this, such as not to use java, but due to lack of time I'm not able to learn a new language to do so. But I'll try to take a look at them. Thanks a lot – Douglas Mendes Carmona May 16 '17 at 12:54