6

I have to extract some information from a web page, and reformat it for the user.

Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.

Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?

Cheers

Mascarpone
  • 2,516
  • 4
  • 25
  • 46
  • possible duplicate of [Java HTML Parsing](http://stackoverflow.com/questions/238036/java-html-parsing) – jmj Jan 21 '11 at 17:03
  • http://stackoverflow.com/questions/4623427/html-parsing-using-java – jmj Jan 21 '11 at 17:03
  • http://stackoverflow.com/questions/4614211/java-html-parsing – jmj Jan 21 '11 at 17:04
  • This question might be very similar to others, but it has the slight difference of being android-related which has a different set of supported libraries than java. – Mascarpone Jan 22 '11 at 01:17

4 Answers4

7

Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:

http://jsoup.org/

Computerish
  • 9,590
  • 7
  • 38
  • 49
3

I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html

It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).

FolksLord
  • 992
  • 2
  • 9
  • 17
1

We've used HTTPUnit do do this in the past.

Speck
  • 2,259
  • 1
  • 20
  • 29
1

jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).

bltc
  • 371
  • 1
  • 3
  • 9