Simplest way to extract information (parsing) from HTML in Java

Question

I've read a lot of questions on stackoverflow regarding html parsing. I've learned that, when possible, we should avoid regex and use a parser instead. I know that there are a lot of Html/Xml parser but I don't know how to use them properly.

Consider this html, parsed through jTidy. I've got a Document object created by jTidy of this code:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
    <!-- Header content -->
</head>
<body>
    <div id="container">
        <div id="id1"> ... </div>
        <div id="id2"> ... </div>
        <div id="mainContent">
            <div id="section 1">
                <div id="subSection">
                    <!-- Interested part -->
                    <tbody>
                        <tr class="success">
                            <td class="fileName"><span>File One</span></td>
                        </tr>
                        <tr class="fail">
                            <td class="fileName"><span>File Two</span></td>
                        </tr>                        
                        <tr class="success">
                            <td class="fileName"><span>File Three</span></td>
                        </tr>
                    </tbody>
                </div>
            </div>
        </div>
    </div>
</body>

Now, I would like to map (in a Map :D ) each filename with its class (success/fail). I can do it with DOM, but I should create a NodeList and for each Element create a new nodelist (lots of memory and boring). There are alternatives like Sax, Xerces etc etc. but I don't know advantages/disadvantages of them.

What is the simplest (and fastest) way to extract those information from the "jTyded" html above?

Use an XPath http://stackoverflow.com/questions/7049150/how-to-extract-data-using-jtidy-and-xpath — Greg, Feb 26 '12 at 18:13
I've read about XPath but the problem is that i should: 1) create a pattern for filenames 2) create a pattern for classes 3) associate class/filename It's not very simple — Angelo, Feb 26 '12 at 18:17

score 1 · Accepted Answer · answered Feb 27 '12 at 10:36

First of all - you forgot to add <table> tag.

You can very easy parse you code with Jsoup

Here is an example:

//  String html =" ...here goes your html code... ";
// Document doc = Jsoup.parse(html);
// Or from file:
    File input = new File("com.htm");
    Document doc = Jsoup.parse(input, "UTF-8");
    Elements trs = doc.select("tr"); //select all "tr" elements from document
    for(Element tr:trs){
        //Getting the class string form tr element
        System.out.println("The file class is: " + tr.attr("class") 
       //getting the filename string that holds inside td element
         + " The filamee is: "  + tr.select("td").text());
    }
}

Thank you. was ignored because of too much indentation. Thanks again! — Angelo, Feb 27 '12 at 11:19

score 0 · Answer 2 · edited May 23 '17 at 11:48

In my opinion the best approach would be to use XSLT+XPath (as Greg suggested in comment) in order to produce input for unmarshaller.

So the entire flow looks like below: HTML->[jTidy purifying]->XHTL->[XSLT transformation]->string data representation->[JAXB unmarshaller]->Java object(s).

If you don't want to have objects produced, use only XPath as described in this thread: How to read XML using XPath in Java

Simplest way to extract information (parsing) from HTML in Java

2 Answers2