I need to supply a base URL (such as http://www.wired.com
) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?
Thanks.
I need to supply a base URL (such as http://www.wired.com
) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?
Thanks.
I have used Web Harvest
a couple of times, and it is quite good for web scraping.
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
Alternatively, you can roll your own web scraper using tools such as JTidy
to first convert an HTML document to XHTML, and then processing the information you need with XPath
. For example, a very naïve XPath expression to extract all hyperlinks from http://www.wired.com
, would be something like //a[contains(@href,'wired')]/@href
. You can find some sample code for this approach in this answer to a similar question.
'Simple' is perhaps not a relevant concept here. it's a complex task. I recommend nutch.