1

What is the preferred way to extract elements from a HTML page in Java?

My HTML is has many of the following rows:

<tr class="item-odd">
       <td class="data"><a href="http://.....">TITLE</a></td>
       <td><div class="cost">$1.99</div></td>
</tr>

The class alternates item-odd and item-even.

I need to extract:

  1. Url
  2. Title
  3. price

Is regular expressions the way to go?

mrblah
  • 99,669
  • 140
  • 310
  • 420
  • 1
    No, not regex. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Mark Byers Jan 06 '10 at 22:54
  • 1
    According to your user activity page, you've asked no fewer than 24 questions in the last 24 hours. Have you ever thought of maybe answering a question once in a while just for kicks? – Michael Myers Jan 06 '10 at 22:59
  • mmyers, i'm learning java and i love this site, it has helped me ALLOT. I am voting and marking questions as answered so am I doing my part in a way. – mrblah Jan 06 '10 at 23:10
  • Not arguing that point. But if you haven't learned enough to answer any questions yet, is it really working? :) – Michael Myers Jan 06 '10 at 23:14
  • mmyers, its been only 2 days with Java! – mrblah Jan 06 '10 at 23:17
  • 1
    mrblah, I've said it before and I'll say it again -- your method of learning Java is increasingly disrespectful of this community. I'll comment over on your latest HtmlUnit question, but I'm not even sure you've learned to read the Javadocs for an API and find the methods you need on your own -- it appears your first instinct has quickly become to ask here rather than seek the information on your own and learn by reading the documentation. – delfuego Jan 07 '10 at 16:49
  • 1
    Setting aside what Delfuego is saying for a moment, which may be fully valid (I haven't bothered reviewing blah's history), I don't think you can chide someone for only asking questions and not answering them. Not everyone is well suited to answering, and the site doesn't cease to be valuable or productive even if only a subset of users actually answer questions. Consider Wikipedia. – Jherico Jan 07 '10 at 20:54

2 Answers2

6

I'd use a library like HTML Parser for this job. Have a look at the samples and/or the javadoc. Also have a look at previous questions here on SO.

HTML Parser is pretty easy to use and should do the job. For alternatives, have a look at this previous answer.

Community
  • 1
  • 1
Pascal Thivent
  • 562,542
  • 136
  • 1,062
  • 1,124
  • is it different that HtmlUnit? looks similiar. – mrblah Jan 06 '10 at 23:00
  • HtmlUnit is a testing tool. HTML Parser is... a parser. So yes, they are different. – Pascal Thivent Jan 06 '10 at 23:02
  • true, but HtmlUnit does have parser type methods, but I get your point! – mrblah Jan 06 '10 at 23:09
  • Well, HtmlUnit need indeed to parse HTML to make assertion on it but the suggested tools allow to do advanced manipulations, to clean crappy html, etc. Just have a look at the API, you'll see. They really have different purpose. – Pascal Thivent Jan 06 '10 at 23:13
  • Say you have a HTML page, how could you get a collection of the above (see question) html? I have maybe 10-20 sets in my HTML, how would I get that with htmlparser? – mrblah Jan 06 '10 at 23:18
  • You could use a filter, or a visitor (as documented on its website). Have a look at the javadoc of NodeVisitor for example (http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/NodeVisitor.html) and try it. Also, Have a look at the samples (http://htmlparser.sourceforge.net/samples.html). – Pascal Thivent Jan 06 '10 at 23:38
3

JTidy does an excellent job of parsing HTML and making it available for manipulation as a DOM. Regular expressions are generally not the way to go, since HTML isn't regular and have numerous edge cases to trip you up.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440