I am new in java.I want to make a simple web crawler.how to access a robots.txt file for a website in java.actually i dont know much about robots.txt. plz help me out.
Asked
Active
Viewed 2,744 times
0
-
The robots.txt file is in a pretty standard location on every single website (since any given number of various search engines need to be able to find it). Accessing it is as simple as performing a get of [url]/robots.txt ;) – Mike McMahon Apr 10 '12 at 23:45
1 Answers
1
You need to solve two tasks:
- use a HTTP library to fetch files over HTTP -- How to send HTTP request in java?
- write or use a parser for robots.txt files -- robots.txt parser java