4

I have a list of elements with information about how deep they are located in an XML tree. The elements at "the bottom," i.e. those elements that occur before an element with a lower depth, contain text.

<input>
    <text n="x" xml:id="a" depth="1"/>
    <div xml:id="b" depth="2"/>
    <div xml:id="c" depth="3"/>
    <p xml:id="e" depth="4">text</p>
    <p xml:id="d" depth="4">text</p>
    <p xml:id="x" depth="4">text</p>
    <div xml:id="f" depth="3"/>
    <lg xml:id="j" depth="4"/>
    <l xml:id="k" depth="5">text</l>
    <l xml:id="l" depth="5">text</l>
    <p xml:id="n" depth="3">text</p>
</input>

I would like to reconstitute this as the XML tree below, in one operation.

<text n="x" xml:id="a" depth="1">
    <div xml:id="b" depth="2">
        <div xml:id="c" depth="3">
            <p xml:id="e" depth="4">text</p>
            <p xml:id="d" depth="4">text</p>
            <p xml:id="x" depth="4">text</p>
        </div>
        <div xml:id="f" depth="3">
            <lg xml:id="j" depth="4">
                <l xml:id="k" depth="5">text</l>
                <l xml:id="l" depth="5">text</l>
            </lg>
        </div>
        <p xml:id="n" depth="3">text</p>
    </div>
</text>

I think I need to start with the elements of the highest depth, i.e. with all elements of depth 5, and then wrap them up in the preceding element of depth 5-1, and so on, but I can't get my head around how to recurse through this.

The @xml:ids are just for reference.

My question is the converse of my earlier stackoverflow question. It resembles this stackoverflow question, but I need to use XQuery.

  • Frankly this'll be a lot cleaner in a simple DOM or SAX application than in XQuery. Personally, I think you should go back to whoever is passing you this ugly depth-annotated thing in the first place and hit them over the head with the XML Recommendation until they give you properly structured data. – keshlam Feb 03 '14 at 14:12
  • I agree with your opinion on the input; but XQuery is quite suitable to do this. – Jens Erat Feb 03 '14 at 14:52
  • Let me then explain what I use this for. In TEI, one can distinguish between block-level elements (that can only contain element contents or whose parent can only contain element contents) and inline elements (that can contain text/mixed contents). A problem has been how to deal with variation in the order and presence of block-level elements. If one first flattens the tree, one is able to address and thus move around and delete/add block-level elements. With Jens Erat's help, I managed to do the first and write an for block-level elements, but how to put Humpty-Dumpty together again? – Jens Østergaard Petersen Feb 03 '14 at 15:30
  • See also https://stackoverflow.com/questions/11395990/how-to-generate-an-xml-file-from-a-set-of-xpath-expressions for a similar question. – Jens Østergaard Petersen Nov 14 '14 at 07:06

3 Answers3

3

Build a function that recursively builds the tree. This code is very generic, by changing the local:getLevel($node) function it should work for arbitrary "flattened" trees.

declare function local:getLevel($node as element()) as xs:integer {
  $node/@depth
};

declare function local:buildTree($nodes as element()*) as element()* {
  let $level := local:getLevel($nodes[1])
  (: Process all nodes of current level :)
  for $node in $nodes
  where $level eq local:getLevel($node)

  (: Find next node of current level, if available :)
  let $next := ($node/following-sibling::*[local:getLevel(.) le $level])[1]
  (: All nodes between the current node and the next node on same level are children :)
  let $children := $node/following-sibling::*[$node << . and (not($next) or . << $next)]

  return
    element { name($node) } {
      (: Copy node attributes :)
      $node/@*,
      (: Copy all other subnodes, including text, pi, elements, comments :)
      $node/node(),

      (: If there are children, recursively build the subtree :)
      if ($children)
      then local:buildTree($children)
      else ()
    }
};

let $xml := document {
  <input>
    <text n="x" xml:id="a" depth="1"/>
    <div xml:id="b" depth="2"/>
    <div xml:id="c" depth="3"/>
    <p xml:id="e" depth="4">text</p>
    <p xml:id="d" depth="4">text</p>
    <p xml:id="x" depth="4">text</p>
    <div xml:id="f" depth="3"/>
    <lg xml:id="j" depth="4"/>
    <l xml:id="k" depth="5">text</l>
    <l xml:id="l" depth="5">text</l>
    <p xml:id="n" depth="3">text</p>
  </input>
}

return local:buildTree($xml/input/*)

Hereby I release this code to the public domain.

If your XQuery processor does not support enhanced FLWOR expressions, you need to reorder some of the lines; I omitted the comments:

  for $node in $nodes
  let $level := local:getLevel($nodes[1])
  let $next := ($node/following-sibling::*[local:getLevel(.) le $level])[1]
  let $children := $node/following-sibling::*[$node << . and (not($next) or . << $next)]
  where $level eq local:getLevel($node)
Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • Thanks again, Jens - I will be sure to credit you for both your postings! – Jens Østergaard Petersen Feb 03 '14 at 15:31
  • This one is beautiful. – Jens Østergaard Petersen Feb 03 '14 at 15:39
  • I was going to put the two solutions into a cookbook of XQuery solutions for comparisionhttp://kitwallace.co.uk/Book/set/maketree but I couldnt get this one to work? - it doesn't compile (for/where is followed by let) and if I insert an extra return, it only yields the top node [on eXist-db 2.0] – Chris Wallace Feb 03 '14 at 17:25
  • Just switch the two statements. I wrote that code using BaseX which supports enhanced FLWOR expressions and this order seems both more intuitive and will be faster if not optimized by the XQuery engine anyway: the level test is really cheap (O(1)), whereas finding the children is in O(n), so better filter before searching them. – Jens Erat Feb 03 '14 at 18:08
  • Sorry Jens I dont quite follow what you are suggesting -wouldn't it be better to post valid XQuery anyway? – Chris Wallace Feb 03 '14 at 18:20
  • 1
    I added it as an alternative to the end of my question. – Jens Erat Feb 03 '14 at 19:24
  • The problem (at least for XQuery on eXist, I havent tried it elsewhere) , is the // in the chidren expression. The script works correctly when // is replaced by /. The return can go after the where (I presume thats what the BaseX processor is doing ) . I've put both into my Cookbook http://kitwallace.co.uk/Book/set/maketree/execute and actually the script using intersect is somewhat faster than the script using the order operator, though this is only tested on eXist of course. – Chris Wallace Feb 03 '14 at 23:21
  • That one definitely shouldn't be there. Just queries lots of nodes that get excluded anyway. Thank you for noticing. – Jens Erat Feb 04 '14 at 06:45
  • Just to explain: Jens Erat's code originally had `let $children := $node//following-sibling::*[$node << . and (not($next) or . << $next)]`, but the `//` has now been corrected to `/`. – Jens Østergaard Petersen Feb 04 '14 at 07:12
  • You can try this solution live at http://try.zorba.io/queries/xquery/SR7HkqXkPfe4VAIb5k0%2BMhxE3ug%3D – wcandillon Feb 07 '14 at 13:36
2

Just to propose another approach - I dont think I've used intersect in anger before!

declare function local:buildTree($nodes,$level)  {
  for $node in $nodes[@depth=$level]
  let $end := $node/following-sibling::*[@depth = $level][1]
  let $rest := 
       if ($end) 
       then $node/following-sibling::*  intersect  $end/preceding-sibling::*
       else $node/following-sibling::*
  return 
    element {$node/name()} {
        $node/@*,
        $node/node(),
        local:buildTree($rest,$level+1)
    }
};
declare function local:buildTree($node) {
       local:buildTree($node/*,1)
};
let $xml := document {
  <input>
    <text n="x" xml:id="a" depth="1"/>
    <div xml:id="b" depth="2"/>
    <div xml:id="c" depth="3"/>
    <p xml:id="e" depth="4">text</p>
    <p xml:id="d" depth="4">text</p>
    <p xml:id="x" depth="4">text</p>
    <div xml:id="f" depth="3"/>
    <lg xml:id="j" depth="4"/>
    <l xml:id="k" depth="5">text</l>
    <l xml:id="l" depth="5">text</l>
    <p xml:id="n" depth="3">text</p>
  </input>
}

return local:buildTree($xml/input)
Chris Wallace
  • 495
  • 3
  • 11
  • 1
    This is definitely more elegant and high-level, but might require that both sets get calculated first. I guess (without any proof) that the node order operator will run faster. – Jens Erat Feb 03 '14 at 18:10
  • 1
    Actually in eXist, intesect is faster see http://kitwallace.co.uk/Book/set/maketree/execute – Chris Wallace Feb 04 '14 at 07:51
  • 1
    Will have a look into that when I've got more spare time... There are a few other differences, too. Though this analysis would be more interesting with larger documents of real-world size. – Jens Erat Feb 04 '14 at 08:26
  • Thank you for that solution. One can also use XQuery 3.0's `tumbling window` to simplify it a little. I've posted the version with `tumbling window` as a new answer. – favq Jul 20 '14 at 10:48
2

Here is another version, based on Chris Wallace's approach, but using XQuery 3.0's tumbling window construct, which makes this code slightly simpler.

declare function local:buildTree($nodes,$level)  {
  for tumbling window $node-window in $nodes
  start $start when $start/@depth = $level
  let $rest := fn:tail($node-window)
  return 
    element {$start/fn:name()} {
        $start/@*,
        $start/node(),
        local:buildTree($rest,$level+1)
    }
};
declare function local:buildTree($node) {
       local:buildTree($node/*,1)
};
let $xml := document {
  <input>
    <text n="x" xml:id="a" depth="1"/>
    <div xml:id="b" depth="2"/>
    <div xml:id="c" depth="3"/>
    <p xml:id="e" depth="4">text</p>
    <p xml:id="d" depth="4">text</p>
    <p xml:id="x" depth="4">text</p>
    <div xml:id="f" depth="3"/>
    <lg xml:id="j" depth="4"/>
    <l xml:id="k" depth="5">text</l>
    <l xml:id="l" depth="5">text</l>
    <p xml:id="n" depth="3">text</p>
  </input>
}

return local:buildTree($xml/input)
favq
  • 739
  • 1
  • 11
  • 25