7

I want to know how to parse the robots.txt in java.

Is there already any code?

djot
  • 2,952
  • 4
  • 19
  • 28
zahir hussain
  • 3,711
  • 10
  • 29
  • 36

3 Answers3

5

Heritrix is an open-source web crawler written in Java. Looking through their javadoc, I see that they have a utility class Robotstxt for parsing the robots.txt file.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
  • There is a bug in Robotstxt. Please do not use it. Wasted a lot of time. To a file like this : User-agent : * Disallow: / AllowAll method of Robotstxt says "true". – 10101010 Apr 27 '15 at 06:17
2

There's also jrobotx library hosted at SourceForge.

(Full disclosure: I spun off the code that forms that library.)

Alan Krueger
  • 4,701
  • 4
  • 35
  • 48
0

There is also a new release of crawler-commons:

https://github.com/crawler-commons/crawler-commons

The library aims to implement functionality common to any web crawler and this includes a very handy robots.txt parser

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
anastluc
  • 301
  • 2
  • 4