Parse HTML page in Java

Question

I'm parsing this page segment:

<tr valign="middle">
   <td class="inner"><span style=""><span class="" title=""></span> 2  <span class="icon ok" title="Verified"></span> </span><span class="icon cat_tv" title="Video » TV" style="bottom:-2;"></span> <a href="/VALUE.html" style="line-height:1.4em;">VALUE</a> </td>
   <td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
   <td width="1%" align="right" nowrap="nowrap" class="small inner" >VALUE</td>
   <td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
</tr>

I have this segment in variable tv: HtmlElement tv = tr.get(i);

I read tag <a href="/VALUE.html" style="line-height:1.4em;">VALUE</a> in this way:

HtmlElement a = tv.getElementsByTagName("a").get(0);        
object.name.value(a.getTextContent());

url = a.getAttribute("href");
object.url_detail.value(myBase + url);

How can I read only VALUE field of the other <td>....</td> sections?

maybe using `tv.getElementsByTagName("td")` and looping over the result and getting the text content using `getTextContent()` ? did you try that ? — A4L, Mar 12 '13 at 13:02

score 5 · Accepted Answer · edited May 23 '17 at 12:29

5

I would suggest using XPath, which is the recommended way of parsing XML/HTML

Reference: How to read XML using XPath in Java

Also take a look at this question: RegEx match open tags except XHTML self-contained tags

Update

If I understood correctly, you need the "VALUE" from each td, right? If so, your XPath would something like this:

//td[@class="small inner"]/text()

edited May 23 '17 at 12:29

Community

1
1

answered Mar 12 '13 at 13:05

Andrei Sfat

8,440
5
49
69

Shaowei Ling · Answer 2 · 2014-03-11T01:32:53.170

You may try a wonderful java package jsoup.

UPDATE: using the package, you can solve the problem like this:

    String html = "<tr valign=\"middle\">"
            + "   <td class=\"inner\">"
            + "   <span style=\"\"><span class=\"\" title=\"\"></span> 2  <span class=\"icon ok\" title=\"Verified\"></span> </span><span class=\"icon cat_tv\" title=\"Video » TV\" style=\"bottom:-2;\"></span>"
            + "   <a href=\"/VALUE.html\" style=\"line-height:1.4em;\">VALUE</a> "
            + "   </td>"
            + "   <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
            + "   <td width=\"1%\" align=\"right\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
            + "   <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
            + "</tr>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    Elements labelPLine = doc.select("a[href]");
    System.out.println("value 1:" + labelPLine.text());

    Elements labelPLine2 = doc.select("td[width=1%");
    Iterator<Element> it = labelPLine2.iterator();
    int n = 2;
    while (it.hasNext()) {
        System.out.println("value " + (n++) + ":" + it.next().text());
    }

The result would be:

value 1:VALUE
value 2:VALUE
value 3:VALUE
value 4:VALUE

You should say how you could solve the problem using jsoup. Otherwise this is a non-answer and should just have been a comment. — Bull, Mar 10 '14 at 04:56

Parse HTML page in Java

2 Answers2