Java convert a DOM object into paragraphs

Question

Is there a similar function to the python function lmxl.sax.saxify [1] in java to generate SAX events from a DOM and fire them against a SAX ContentHandler. The main intention is to convert a DOM object into a list of paragraphs. given this html snippet

<p> Here is a text! 
<ul><li>list1</li><li>list2</li></ul>
</p>

the output that I want is:

1st paragraph: Here is a text!
2nd paragraph: list1
3rd paragraph: list2

[1] http://lxml.de/api/lxml.sax-module.html#saxify

so you want to retrieve all text nodes from a DOM document? – Sharon Ben Asher Jul 06 '15 at 14:30 — Sharon Ben Asher, Jul 06 '15 at 14:30

score 0 · Answer 1 · answered Jul 06 '15 at 14:43

0

Yes, you can run a transformation using a DOMSource and a SAXResult, see http://www.java2s.com/Code/Java/XML/GeneratingSAXParsingEventsbyTraversingaDOMDocument.htm:

Source source = new DOMSource(doc);

URI uri = new File("infilename.xml").toURI();
source.setSystemId(uri.toString());

DefaultHandler handler = new MyHandler();
SAXResult result = new SAXResult(handler);
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);

But why don't you extract the information you want from your DOM itself?

answered Jul 06 '15 at 14:43

JP Moresmau

7,388
17
31

And how is that? I am new to DOM parsing – Daisy Jul 06 '15 at 21:04
Also, the files that I would like to parse are html not xml. – Daisy Jul 07 '15 at 07:48

score 0 · Answer 2 · answered Jul 07 '15 at 09:47

If you want to retrieve all text nodes from a DOM document (it is a different question then the original), then Xpath is the easiest (and most efficient) way to search and extract data from a DOM document

Here is the piece of code you need:

Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse("/path/example.html");
XPath xPath =  XPathFactory.newInstance().newXPath();
String pattern = "//*/text()"; // retrieve all text nodes in the doc
NodeList nl = (NodeList)xPath.compile(pattern)
        .evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < nl.getLength() ; i++) {
    Node n = nl.item(i);
    String text = n.getNodeValue().trim();
    // skip over whitespace-only text
    if (text != null && text.isEmpty() == false) {
        System.out.println(text);
    }
}

Thanks for your reply. I tried your code but I have encountered this error `[Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'` do you have any idea if it is because I am parsing an html not xml or not? — Daisy, Jul 07 '15 at 12:11
quick googling suggests that indeed it is a case with SGML DTD. there is a thread here that suggests ways to overcome this http://stackoverflow.com/questions/155101/make-documentbuilder-parse-ignore-dtd-references — Sharon Ben Asher, Jul 07 '15 at 13:00

Java convert a DOM object into paragraphs

2 Answers2