cyberneko html settings to ignore unencoded greater than and less than symbol

Question

I'm having htmlcontent which contains greater than and less than symbol. But those symbols are not encoded as < and >. To balance tags in the content i pass the content through cyberneko html parser. After parsing content in between those greater than and less than symbol are discared. To overcome this problem, what settings i have to set up in the cyberneko html parser?

sample content:

<div>Average Response Time server is critical because its value 282 > 0 ms. <br>[Threshold Details : Critical if value > 0, Warning if value = 0, Clear if value < 0]</div>

After nekohtml parsing

<div><br> 0]</div>

Please help. Thanks in advance

score 1 · Answer 1 · answered Jun 23 '11 at 13:23

The program below will output

<div>Average Response Time server is critical because its value 282 > 0 ms. <br/>[Threshold Details : Critical if value > 0, Warning if value = 0, Clear if value < 0]</div>

package test;

import java.io.StringReader;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.apache.xerces.dom.DocumentImpl;
import org.cyberneko.html.parsers.DOMFragmentParser;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentFragment;
import org.xml.sax.InputSource;

public class TestHTMLDOMFragment {
    private static final String PARSE_TEXT = "<div>Average Response Time server is critical because its value 282 > 0 ms. <br>[Threshold Details : Critical if value > 0, Warning if value = 0, Clear if value < 0]</div>";

    public static void main(String[] argv) throws Exception {
        DOMFragmentParser parser = new DOMFragmentParser();

        // output the elements in lowercase, nekohtml doesn't do this by default
        parser.setProperty("http://cyberneko.org/html/properties/names/elems","lower");

        // if this is set to true (the default, you dont need to specifiy this)
        // then neko html wont and an html,head and body tags to the response.
        parser.setFeature("http://cyberneko.org/html/features/document-fragment",true);

        Document document = new DocumentImpl();
        DocumentFragment fragment = document.createDocumentFragment();

        // parse the document into a fragment
        parser.parse(new InputSource(new StringReader(PARSE_TEXT)), fragment);

        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        // don't display the namespace declaration
        transformer.setOutputProperty("omit-xml-declaration", "yes");
        DOMSource source = new DOMSource(fragment);
        StreamResult result = new StreamResult(System.out);
        transformer.transform(source, result);

    }
}

The comments in the code above show the parser settings i've used.

I've also used the org.cyberneko.html.parsers.DOMFragmentParser as you may also be parsing text that is just an html fragment

I'm using nekohtml 1.9.14

If you use maven, here's the pom.xml dependencies section...

<dependencies>
    <dependency>
        <groupId>net.sourceforge.nekohtml</groupId>
        <artifactId>nekohtml</artifactId>
        <version>1.9.14</version>
        <type>jar</type>
    </dependency>
</dependencies>

cyberneko html settings to ignore unencoded greater than and less than symbol

1 Answers1