Given that an application has:
- robots.txt contents,
- URL of interest and
- browsing entity metadata (like user-agent string, etc)
how to check if a particular URL is allowed by robots.txt?
Given that an application has:
how to check if a particular URL is allowed by robots.txt?
crawler-commons is a Java API which can parse robots files given a particular robot name and return the rules applicable for that robot. The rules have an isAllowed(String url)
method which does what you are after.