5

I need to supply a base URL (such as http://www.wired.com) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?

Thanks.

João Silva
  • 89,303
  • 29
  • 152
  • 158
rs79
  • 2,311
  • 2
  • 33
  • 39

2 Answers2

5

I have used Web Harvest a couple of times, and it is quite good for web scraping.

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Alternatively, you can roll your own web scraper using tools such as JTidy to first convert an HTML document to XHTML, and then processing the information you need with XPath. For example, a very naïve XPath expression to extract all hyperlinks from http://www.wired.com, would be something like //a[contains(@href,'wired')]/@href. You can find some sample code for this approach in this answer to a similar question.

Community
  • 1
  • 1
João Silva
  • 89,303
  • 29
  • 152
  • 158
  • Thanks for this resource. I was able to adapt it successfully. However, if a webpage response results in a 500, the scraper fails (for instance - http://www.allure.com/magazine/flipbook) outputting a "An invalid XML character (Unicode: 0x0) was found in the element content of the document." Any thoughts on this error message? – rs79 Feb 22 '11 at 20:52
2

'Simple' is perhaps not a relevant concept here. it's a complex task. I recommend nutch.

bmargulies
  • 97,814
  • 39
  • 186
  • 310