0

Is there a similar function to the python function lmxl.sax.saxify [1] in java to generate SAX events from a DOM and fire them against a SAX ContentHandler. The main intention is to convert a DOM object into a list of paragraphs. given this html snippet

<p> Here is a text! 
<ul><li>list1</li><li>list2</li></ul>
</p>

the output that I want is:

  • 1st paragraph: Here is a text!
  • 2nd paragraph: list1
  • 3rd paragraph: list2

[1] http://lxml.de/api/lxml.sax-module.html#saxify

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
Daisy
  • 847
  • 3
  • 13
  • 34

2 Answers2

0

Yes, you can run a transformation using a DOMSource and a SAXResult, see http://www.java2s.com/Code/Java/XML/GeneratingSAXParsingEventsbyTraversingaDOMDocument.htm:

Source source = new DOMSource(doc);

URI uri = new File("infilename.xml").toURI();
source.setSystemId(uri.toString());

DefaultHandler handler = new MyHandler();
SAXResult result = new SAXResult(handler);
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);

But why don't you extract the information you want from your DOM itself?

JP Moresmau
  • 7,388
  • 17
  • 31
0

If you want to retrieve all text nodes from a DOM document (it is a different question then the original), then Xpath is the easiest (and most efficient) way to search and extract data from a DOM document

Here is the piece of code you need:

Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse("/path/example.html");
XPath xPath =  XPathFactory.newInstance().newXPath();
String pattern = "//*/text()"; // retrieve all text nodes in the doc
NodeList nl = (NodeList)xPath.compile(pattern)
        .evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < nl.getLength() ; i++) {
    Node n = nl.item(i);
    String text = n.getNodeValue().trim();
    // skip over whitespace-only text
    if (text != null && text.isEmpty() == false) {
        System.out.println(text);
    }
}
Sharon Ben Asher
  • 13,849
  • 5
  • 33
  • 47
  • Thanks for your reply. I tried your code but I have encountered this error `[Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'` do you have any idea if it is because I am parsing an html not xml or not? – Daisy Jul 07 '15 at 12:11
  • quick googling suggests that indeed it is a case with SGML DTD. there is a thread here that suggests ways to overcome this http://stackoverflow.com/questions/155101/make-documentbuilder-parse-ignore-dtd-references – Sharon Ben Asher Jul 07 '15 at 13:00