9

How does one tell the XML parser to honor leading and trailing whitespace?

Dim xml: Set xml = CreateObject("MSXML2.DOMDocument")
xml.async = False
xml.loadxml "<xml>1 2</xml>"
wscript.echo len(xml.documentelement.text)

Above prints out 3.

Dim xml: Set xml = CreateObject("MSXML2.DOMDocument")
xml.async = False
xml.loadxml "<xml> 2</xml>"
wscript.echo len(xml.documentelement.text)

Above prints out 1. (I'd like it to print 2).

Is there something special I can put in the xml document itself to tell the parser to keep leading and trailing whitespace in the document?

CLARIFICATION 1: Is there an attribute that can be specificed ONCE at the beginning of the document to apply to all elements?

CLARIFICATION 2: Because the contents of the entities may have unicode data, but the xml file needs to be plain ascii, all entities are encoded -- meaning CDATA's unfortunately are not available.

Michael Haren
  • 105,752
  • 40
  • 168
  • 205
Michael Pryor
  • 25,046
  • 18
  • 72
  • 90
  • CDATA certainly is available. You might just have to use more than one per element value, though. – Rob Kennedy Jan 05 '09 at 21:58
  • @michaelpryor: About all answers recommending "xml:space". This problem has nothing to do with xml:space, which controls how a parser treats whitespace-*only* nodes. The nodes shown are definitely not whitespace-only. See my solution, which is the only one that really treats the problem. Cheers, – Dimitre Novatchev Jan 06 '09 at 04:53
  • 1
    The problem has *nothing* to do with CDATA. CDATA is only there at parsing time, in the infoset, it is no longer present and whitespaces *are* part of the infoset. – bortzmeyer Jan 06 '09 at 10:05

2 Answers2

8

As I commented, all answers recommending the usage of the xml:space="preserve" are wrong.

The xml:space attribute can only be used to control the treatment of whitespace-only nodes, that is text nodes composed entirely of whitespace characters.

This is not at all the case with the current problem.

In fact, the code provided below correctly obtains a length of 2 for the text node contained in:

<xml> 2</xml>

Here is the VB code that correctly gets the length of the text node (do not forget to add a reference to "Microsoft XML, v 3.0"):

Dim xml As MSXML2.DOMDocument
Private Sub Form_Load()
Set xml = CreateObject("MSXML2.DOMDocument")
xml.async = False
xml.loadxml "<xml> 2</xml>"
Dim n
n = Len(xml.documentelement.selectSingleNode("text()").nodeValue)
wscript.echo Len(n)
End Sub

If you put a breakpoint on the line:

wscript.echo Len(n)

you'll see that when the debugger breaks there, the value of n is 2, as it is required.

Therefore, this code is the solution that was being sought.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • the xml:space="preserve" attribute worked though. I don't know who deleted the answers that suggested it, but that worked fine for me. – Michael Pryor Jan 09 '09 at 20:25
  • 4
    @michaelpryor: More accurately, the answer to the orig. q. is: "No, nothing special needs be put in the XML document as the parser does not trim any non-white-space text node. Simply use the "nodeValue" property and do not use the "text" property. – Dimitre Novatchev Jan 10 '09 at 04:59
  • On the line which assigns value to `n`, should we have `Len`? – Jaywalker Jun 29 '12 at 09:07
  • @Jaywalker: Yes, we *do* have `Len` that returns the length of the string. – Dimitre Novatchev Jun 29 '12 at 12:12
4

As mentioned by Dimitre Novatchev, for XML, whitespace is not deleted at will by the parser. The white space is part if the node's value. Since I do not speak Visual Basic, here is a C program with libxml which prints the length of the first text node. There is absolutely no need to set xml:space.

% ./whitespace "<foo> </foo>"
Length of " " is 1

% ./whitespace "<foo> 2</foo>"
Length of " 2" is 2

% ./whitespace "<foo>1 2</foo>" 
Length of "1 2" is 3

Here is the program:

#include <stdio.h>
#include <string.h>
#include <libxml/parser.h>

int
main(int argc, char **argv)
{
    char           *xml;
    xmlDoc         *doc;
    xmlNode        *first_child, *node;
    if (argc < 2) {
        fprintf(stderr, "Usage: %s XML-string\n", argv[0]);
        return 1;
    }
    xml = argv[1];
    doc = xmlReadMemory(xml, strlen(xml), "my data", NULL, 0);
    first_child = doc->children;
    first_child = first_child->children;        /* Skip the root */
    for (node = first_child; node; node = node->next) {
        if (node->type == XML_TEXT_NODE) {
            fprintf(stdout, "Length of \"%s\" is %i\n", (char *) node->content,
                    strlen((char *) node->content));
        }
    }
    return 0;
}
bortzmeyer
  • 34,164
  • 12
  • 67
  • 91