0

I am using NekoHTML framework with xerces 2.11.0 version to parse an HTML document. But i am having a problem with this simple code :

DOMParser parser = new DOMParser();
System.out.println(parser.getClass().toString());
InputSource url = new InputSource("http://www.cbgarden.org");
try{
    parser.parse(url);
    Document document = parser.getDocument();
    System.out.println(document.hasChildNodes());
    System.out.println(document.getBaseURI());
    System.out.println(document.getNodeName());
    System.out.println(document.getNodeValue());
}catch(Exception e){
    e.printStackTrace();
}

Now I put here the result of the multiple prints:

  1. class org.cyberneko.html.parsers.DOMParser
  2. true
  3. http://www.cbgarden.org
  4. document
  5. null

So my question is : What could be wrong ? No exception is thrown and I am following the rules that are defined in the usage rules in the NekoHTML. My build path libraries are with this precedence:

  1. nekohtml.jar
  2. nekohtmlSamples.jar
  3. xercesImpl.jar
  4. xercesSamples.jar
  5. xml-apis.jar
cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
tt0686
  • 1,771
  • 6
  • 31
  • 60
  • I just have one more question regarding this subject : Why the method parser.getDocument() returns one document with two nodes, in which one of them is NULL ? – tt0686 Oct 11 '11 at 16:54

1 Answers1

1

I guess your question is about the null?
The document node has no value. It only has subnodes (like <html> witch contains <head> and <body>).

But if you want to have the whole page source as a String, you can simply download it using a URL its method openStream().

Martijn Courteaux
  • 67,591
  • 47
  • 198
  • 287
  • Yeah i am seeing this know.If i make document.getChildNodes() , the result will be two nodes, one of them return "HTML" if i use getLocalName() and the other returs NULL.How i see all the document , if i use document.toString it returns [document: null] – tt0686 Oct 11 '11 at 16:38