differentiating between empty textnode and textnode with whitespaces

Question

While validating an xml file, I want to logg any text-node with empty content. A newline \n is also considered a texnode but it is not what I want to aprove. In the following code: 'parent' has two textnodes of content '\n' that are not interesting to me. The content of 'elem1' is '\n\n', which is an error and must be reported. 'elem2' has a valid content. Content of 'books' is empty and must be reported.

In my first try I searched each text-node for [\n\t\r] and would ignore them. But this way I would also ignore elem1 which should have been reported as error.

What is the point I am doing wrong? (notice: I have to solve this issue without xsd-validation)

Update 1): I have added more \n between the elements. Now the first 'parent' node has 5 textnodes with content: \n

<root>

    <parent>

        <elem1>

        </elem1> 

        <elem2>good content of el2</elem2>

        <elem3> half so good
               contentof el3</elem3>
    </parent>
    
    <parent>
        <elem1>
        </elem1> 

        <elem2>good content</elem2>
        <elem3>good</elem3>

        <elem4></elem4>

    </parent>

    <book></book>
    

</root>

Update 2) for more clearness: wenn a caller calls say validate("//parent/*"), I gather all nodes of this given path and get a nodelist returned. Then I start the validation for each node and its children.

Nodelist result = xpathinstance.validate(path, currentNode, XPathConstants.NODESET)

for (int n = 0; n < result.getLength(); n++) {

            validateThereAreNoGaps(result.item(n));
        }

Wenn I arive on the first 'parent'-element it shows 7 children (after update of example). Each \n between the element-tags is considered a text-node.

As a next solution I am now trying to replace all \n with "" to get rid of them...

If I read you correctly you want to find all text nodes that contain only whitespace, but ignore all of those that contain exactly 1 newline, right? — Joachim Sauer, Jun 30 '20 at 10:17
@Joachim Sauer Not exactly one. The author of the xml file might have inserted several new lines between the elements for readability purpose. — saab, Jun 30 '20 at 10:40
@Mandy8055 using a parser would expect a model definition for the document, which I never use in my validation code. — saab, Jun 30 '20 at 10:46
But you say that `elem1` is an error, which is exactly that: two newlines, why is that an error when adding whitespace for readability is ok? — Joachim Sauer, Jun 30 '20 at 11:56
@Joachim the rule I am validating says: nodes in path //parent/* are expected not to be empty but filled at least with a 'xyz'. — saab, Jun 30 '20 at 12:01
In my answer I have addressed that @saab.Please see once.Also please drop a comment if you need something else. — , Jun 30 '20 at 13:15
@Mandy8055 I search for alle nodes with the given path and get a nodelist back. Then I validate each node in the list and its children for their contents. I have 'nodes' instead of strings, so that I can not apply the regular expression you have suggested. I will update my description. — saab, Jun 30 '20 at 13:46
@Mandy8055 finally I found a way. How do you think about this: the text nodes inside elem1, elem2, and elem3 have neigther a next sibling nor a previous sibling. But a text-node inside the 'parent' (e.g. \n ) has a next sibling which is . The last text node of 'parent' (which is also \n ) has only a previous sibling but no next silbing. — saab, Jun 30 '20 at 18:57
Does the XML format is going to be same always? If yes then your way is worth appreciating =) — , Jun 30 '20 at 19:25
unfortunately it was not a correct solution. I added another elem4 which is empty and should be logged as error. But it is being ignored for \n before and after the . I must have misunderstood the meaning of sibling... — saab, Jun 30 '20 at 23:25

Johnson · Answer 1 · 2020-06-30T11:14:11.533

0

Here's a short expression that might help you:

<(\w+?>)[^\S]*<\/\1

This will select any text node that is empty.

If you don't want to select the tags, just use this:

<(?<=(\w+?>))[^\S]*(?=<\/\1)

However this second one cannot identify:

<books></books>

for example, but in that case I suggest simply using:

><

as your expression to find those seperately.

edited Jun 30 '20 at 11:14

answered Jun 30 '20 at 10:56

Johnson

161
9

differentiating between empty textnode and textnode with whitespaces

1 Answers1