1

Given that an application has:

  • robots.txt contents,
  • URL of interest and
  • browsing entity metadata (like user-agent string, etc)

how to check if a particular URL is allowed by robots.txt?

Denis Kulagin
  • 8,472
  • 17
  • 60
  • 129

1 Answers1

4

crawler-commons is a Java API which can parse robots files given a particular robot name and return the rules applicable for that robot. The rules have an isAllowed(String url) method which does what you are after.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28