Parsing HTML from a web page

Question

I have to extract some information from a web page, and reformat it for the user.

Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.

Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?

Cheers

possible duplicate of [Java HTML Parsing](http://stackoverflow.com/questions/238036/java-html-parsing) — jmj, Jan 21 '11 at 17:03
http://stackoverflow.com/questions/4623427/html-parsing-using-java — jmj, Jan 21 '11 at 17:03
http://stackoverflow.com/questions/4614211/java-html-parsing — jmj, Jan 21 '11 at 17:04
This question might be very similar to others, but it has the slight difference of being android-related which has a different set of supported libraries than java. — Mascarpone, Jan 22 '11 at 01:17

score 7 · Accepted Answer · answered Jan 21 '11 at 17:00

7

Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:

http://jsoup.org/

answered Jan 21 '11 at 17:00

Computerish

9,590
7
38
49

score 3 · Answer 2 · answered Jan 21 '11 at 18:32

3

I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html

It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).

answered Jan 21 '11 at 18:32

FolksLord

992
2
9
17

score 1 · Answer 3 · answered Jan 21 '11 at 17:24

1

We've used HTTPUnit do do this in the past.

answered Jan 21 '11 at 17:24

Speck

2,259
1
20
29

score 1 · Answer 4 · answered Jan 21 '11 at 18:09

1

jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).

answered Jan 21 '11 at 18:09

bltc

371
1
3
9

Parsing HTML from a web page

4 Answers4