What are some good java libraries to search and scrape data out of a web page.

Question

What are some good open source java libraries to search and scrape data out of a web page and stick it into a database. For example, suppose I had a page such as:

<tr><td><b>Address:</b></td>
<td colspan=3>123 My Street        </td></tr>

"Address:" is the key, but I'm actually trying to get "123 My Street" which has a bunch of html tags and spaces in between. Ideally I want to get the value between the td that follows the string "Address:". It seems like JSoup can do the find, but I didn't see a good example on how to do the offset (I may have missed it). Is there a library that handles key/value?

I'd also be interested in learning about any open source (MIT/Apache) initiatives for UI scripting similar to the Kapow Extraction Browser.

Thanks.

Paker · Answer 1 · 2011-12-16T16:53:32.903

2

Try Web-Harvest. It's open-source crawler written in Java.
It can be used as Java library, as command-line application or with it's standalone IDE.

You can use <xpath> element to extract any value from the XHTML document.

edited Dec 16 '11 at 16:53

answered Dec 16 '11 at 16:34

Paker

2,522
1
16
27

score 1 · Answer 2 · answered Jul 29 '11 at 02:28

This is a good list of open source parsers: http://java-source.net/open-source/html-parsers

I've used TagSoup with great success for parsing tens of thousands of web pages in the wild. As for the "key-value" relationship, that's something you'll have to deal with yourself.

What are some good java libraries to search and scrape data out of a web page.

2 Answers2