Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.
Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster
The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.
For more information about Nutch, please see the Nutch wiki.
Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.