Java : HTML Parsing

Question

I am having HTML contents as given below. The tag that i am looking out for here are "img src" and "!important". Does Java provide any HTML parsing techniques?

<fieldset>
<table cellpadding='0'border='0'cellspacing='0'style="clear :both">
<tr valign='top' ><td width='35' >
<a href='http://mypage.rediff.com/android/32868898'class='space' onmousedown="return
 enc(this,'http://track.rediff.com/clickurl=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F3 868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >
<div style='width:25px;height:25px;overflow:hidden;'>
<img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb'  width='25'  vspace='0'  /></div></a></td> <td><span>
<a href='http://mypage.rediff.com/android/32868898'  class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >Android </a> </span><span style='color:#000000
!important;'>android se updates...</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/>

@marcog people do often mix those two so I'm just double checking no harm in that — ant, Jan 06 '11 at 12:42

score 2 · Accepted Answer · edited May 23 '17 at 09:58

2

String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src");
System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb
System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates...

edited May 23 '17 at 09:58

Community

1
1

answered Jan 06 '11 at 11:09

jmj

237,923
42
401
438

and for important part please clarify question – jmj Jan 06 '11 at 11:19
there is a tag important which contains a text, which i need to fetch. – Faheem Kalsekar Jan 06 '11 at 11:58
I can't see any `important` tag in the html you have given – jmj Jan 06 '11 at 11:58

score 1 · Answer 2 · answered Jan 06 '11 at 11:08

Try NekoHtml. This is the HTML parsing library used by various higher-level testing frameworks such as HtmlUnit.

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

score 1 · Answer 3 · answered Jan 06 '11 at 11:20

1

I used jsoup - this library have nice selector syntax (http://jsoup.org/cookbook/extracting-data/selector-syntax), and for your problem you can use code like this:

File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements pngs = doc.select("img[src$=.png]");

answered Jan 06 '11 at 11:20

Igor

1,476
10
6

for 'important' use the following code: doc.getElementsByAttributeValueMatching(String key,String regex). In your case key is "style" (span style="... !important") and regex is "(!important)" – Igor Jan 06 '11 at 12:52

score 1 · Answer 4 · answered Jan 21 '11 at 18:40

I like using Jericho: http://jericho.htmlparser.net/docs/index.html

It is invulnerable to bad formed html, links leading to unavailable locations etc.

There's a lot of examples on their page, you just get all IMG tags and analyze their attributes to extracts those that pass your needs.

Java : HTML Parsing

4 Answers4

Linked