1

What are some good open source java libraries to search and scrape data out of a web page and stick it into a database. For example, suppose I had a page such as:

<tr><td><b>Address:</b></td>
<td colspan=3>123 My Street        </td></tr>

"Address:" is the key, but I'm actually trying to get "123 My Street" which has a bunch of html tags and spaces in between. Ideally I want to get the value between the td that follows the string "Address:". It seems like JSoup can do the find, but I didn't see a good example on how to do the offset (I may have missed it). Is there a library that handles key/value?

I'd also be interested in learning about any open source (MIT/Apache) initiatives for UI scripting similar to the Kapow Extraction Browser.

Thanks.

JStark
  • 2,788
  • 2
  • 29
  • 37

2 Answers2

2

Try Web-Harvest. It's open-source crawler written in Java.
It can be used as Java library, as command-line application or with it's standalone IDE.

You can use <xpath> element to extract any value from the XHTML document.

Paker
  • 2,522
  • 1
  • 16
  • 27
1

This is a good list of open source parsers: http://java-source.net/open-source/html-parsers

I've used TagSoup with great success for parsing tens of thousands of web pages in the wild. As for the "key-value" relationship, that's something you'll have to deal with yourself.

Ryan Stewart
  • 126,015
  • 21
  • 180
  • 199