Java: check, if URL is allowed by robots.txt

Question

Given that an application has:

robots.txt contents,
URL of interest and
browsing entity metadata (like user-agent string, etc)

how to check if a particular URL is allowed by robots.txt?

score 4 · Accepted Answer · answered May 29 '18 at 10:48

4

crawler-commons is a Java API which can parse robots files given a particular robot name and return the rules applicable for that robot. The rules have an isAllowed(String url) method which does what you are after.

answered May 29 '18 at 10:48

Julien Nioche

4,772
1
22
28

2

Cool. That's exactly what is was looking for. – Denis Kulagin May 29 '18 at 11:11
@DenisKulagin worth marking my answer as accepted then? – Julien Nioche May 29 '18 at 13:30
That's for sure! – Denis Kulagin May 29 '18 at 14:46

Java: check, if URL is allowed by robots.txt

1 Answers1