Normalization in DOM parsing with java - how does it work?

Question

I saw the line below in code for a DOM parser at this tutorial.

doc.getDocumentElement().normalize();

Why do we do this normalization ?
I read the docs but I could not understand a word.

Puts all Text nodes in the full depth of the sub-tree underneath this Node

Okay, then can someone show me (preferably with a picture) what this tree looks like ?

Can anyone explain me why normalization is needed?
What happens if we don't normalize ?

Irrespective of your question, please read the note on the example: _"DOM Parser is slow and will consume a lot of memory when it loads an XML document which contains a lot of data. Please consider SAX parser as solution for it, SAX is faster than DOM and use less memory."_. — wulfgarpro, Dec 09 '12 at 10:27
@wulfgar.pro - I understand what you said. But, I want to understand the stuff I asked in the question. I will also do SAX parsing soon. — Apple Grinder, Dec 09 '12 at 11:01
Searching google for "normalize xml" gave some results that seem useful. It looks like its similar to normalization in databases. — Apple Grinder, Dec 09 '12 at 11:37
You'll never understand it if you only read the first third of each sentence. Try reading the *entire* sentence you quoted. The meaning is as plain as a pike staff. — user207421, Dec 09 '12 at 11:42
@EJP - umm...its still not clear because i don't know xml in depth and i only read a few introductory pages on it. BTW, dont get me wrong, you did exactly what the author of the doc did - using complex words instead of plain english (plain as a pike staff = easy to understand). Simple words first and jargon later works better for me. — Apple Grinder, Dec 09 '12 at 12:03
@AppleGrinder There are no 'complex words' in my comment. The sentence I referred to is easy to understand *if you read it all.* The evidence of your quotation shows that you didn't. Don't blame me for that, and don't blame the authors for it either. — user207421, Dec 09 '12 at 20:06
As of this writing the referenced website is referencing this SO post. My brain just threw a dependency error. — chessofnerd, Jul 25 '13 at 21:22

score 385 · Accepted Answer · edited May 14 '13 at 09:31

385

The rest of the sentence is:

where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes.

This basically means that the following XML element

<foo>hello 
wor
ld</foo>

could be represented like this in a denormalized node:

Element foo
    Text node: ""
    Text node: "Hello "
    Text node: "wor"
    Text node: "ld"

When normalized, the node will look like this

Element foo
    Text node: "Hello world"

And the same goes for attributes: <foo bar="Hello world"/>, comments, etc.

edited May 14 '13 at 09:31

Alex Spurling

54,094
23
70
76

answered Dec 09 '12 at 13:07

JB Nizet

678,734
91
1,224
1,255

2

Aha ! its much more clear now. I don't know about data structures (???) and nodes. But I had a quick look at tree structure and, I am guessing that a computer might store "hello world" in the way you suggested. Is that right ? – Apple Grinder Dec 09 '12 at 13:12
9

You need to learn the basics about DOM. Yes, DOM represents an XML document as a tree. And in a tree, you have a root node having child node, each child node also having child nodes, etc. That's what a tree is. Element is a kind of node, and TextNode is another kind of node. – JB Nizet Dec 09 '12 at 13:20
7

Thanks JB Nizet. Can't tell you how relieved I am after getting some direction. – Apple Grinder Dec 09 '12 at 13:26
I think your example shouldn't contain newlines: `Hello world` – user2043553 Jun 27 '14 at 08:10
2

@user2043553, the newlines are actually the point there. Without newlines, you wouldn't see the difference. If you shouldn't have understood: Normalization "corrects" the XML so one tag is interpreted as one element. If you didn't do that, it might happen that these very newlines are interpreted as delimiters between several elements of the same type (resp. in the same tag). – Stacky Oct 23 '14 at 15:59
1

@Stacky, in the example there are two new lines, they are not displayed after normalizing in the example which might make people believe there are not there anymore. The resulting text node with newlines displayed would look like: "Hello\nwor\nld" Normalizing does not remove newlines. – Christian Mar 22 '15 at 19:09
Why is there `Text node: ""` in denormalized node? – Malwinder Singh Jun 04 '15 at 14:38
@M.S. there is not necessarily. The parset is free to parse the text in as many text nodes it wants. – JB Nizet Jun 04 '15 at 14:41
@JBNizet:is there any way so that i can ensure the tree built in both cases should be same?? Please check http://stackoverflow.com/questions/30940162/dom-parser-wrong-childnodes-count#comment49916498_30940162 – user3930361 Jun 19 '15 at 16:09

AVA · Answer 2 · 2018-07-26T15:07:58.243

In simple, Normalisation is Reduction of Redundancies.
Examples of Redundancies:
a) white spaces outside of the root/document tags(...<document></document>...)
b) white spaces within start tag (<...>) and end tag (</...>)
c) white spaces between attributes and their values (ie. spaces between key name and =")
d) superfluous namespace declarations
e) line breaks/white spaces in texts of attributes and tags
f) comments etc...

score 7 · Answer 3 · answered Jun 18 '15 at 06:39

As an extension to @JBNizet's answer for more technical users here's what implementation of org.w3c.dom.Node interface in com.sun.org.apache.xerces.internal.dom.ParentNode looks like, gives you the idea how it actually works.

public void normalize() {
    // No need to normalize if already normalized.
    if (isNormalized()) {
        return;
    }
    if (needsSyncChildren()) {
        synchronizeChildren();
    }
    ChildNode kid;
    for (kid = firstChild; kid != null; kid = kid.nextSibling) {
         kid.normalize();
    }
    isNormalized(true);
}

It traverses all the nodes recursively and calls kid.normalize()
This mechanism is overridden in org.apache.xerces.dom.ElementImpl

public void normalize() {
     // No need to normalize if already normalized.
     if (isNormalized()) {
         return;
     }
     if (needsSyncChildren()) {
         synchronizeChildren();
     }
     ChildNode kid, next;
     for (kid = firstChild; kid != null; kid = next) {
         next = kid.nextSibling;

         // If kid is a text node, we need to check for one of two
         // conditions:
         //   1) There is an adjacent text node
         //   2) There is no adjacent text node, but kid is
         //      an empty text node.
         if ( kid.getNodeType() == Node.TEXT_NODE )
         {
             // If an adjacent text node, merge it with kid
             if ( next!=null && next.getNodeType() == Node.TEXT_NODE )
             {
                 ((Text)kid).appendData(next.getNodeValue());
                 removeChild( next );
                 next = kid; // Don't advance; there might be another.
             }
             else
             {
                 // If kid is empty, remove it
                 if ( kid.getNodeValue() == null || kid.getNodeValue().length() == 0 ) {
                     removeChild( kid );
                 }
             }
         }

         // Otherwise it might be an Element, which is handled recursively
         else if (kid.getNodeType() == Node.ELEMENT_NODE) {
             kid.normalize();
         }
     }

     // We must also normalize all of the attributes
     if ( attributes!=null )
     {
         for( int i=0; i<attributes.getLength(); ++i )
         {
             Node attr = attributes.item(i);
             attr.normalize();
         }
     }

    // changed() will have occurred when the removeChild() was done,
    // so does not have to be reissued.

     isNormalized(true);
 }

Hope this saves you some time.

Normalization in DOM parsing with java - how does it work?

3 Answers3

Linked