4

I have an XML file that represents the syntax trees of all the sentences in a book:

<book>
    <sentence>
        <w class="pronoun" role="subject">
            I
        </w>
        <wg type="verb phrase">
            <w class="verb" role="verb">
                like
            </w>
            <wg type="noun phrase" role="object">
                <w class="adj">
                    green
                </w>
                <w class="noun">
                    eggs
                </w>
            </wg>
        </wg>
    </sentence>
    <sentence>
        ...
    </sentence>
    ...
</book>

This example is fake, but the point is that the actual words (the <w> elements) are nested in unpredictable ways based on syntactic relationships.

What I'm trying to do is find <sentence> nodes with <w> children matching particular criteria in a certain order. For example, I may be looking for a sentence with a w[@class='pronoun'] descendant followed by a w[@class='verb'] descendant.

It's easy to find sentences that just contain both descendants, without caring about ordering:

//sentence[descendant::w[criteria1] and descendant::w[criteria2]]

I did manage to figure out this query that does what I want, which looks for a <w> with a following <w> matching the criteria with the same closest <sentence> ancestor:

for $sentence in //sentence
where $sentence[descendant::w[criteria1 and 
    following::w[(ancestor::sentence[1] = $sentence) and criteria2]]]
return ...

...but unfortunately it's very slow, and I'm not sure why.

Is there a non-slow way to search for a node that contains descendants matching criteria in a certain order? I'm using XQuery 3.1 with BaseX. If I can't find a reasonable way to do this with XQuery, plan B is to do post-processing with Python.

Christian Grün
  • 6,012
  • 18
  • 34
sheesania
  • 237
  • 3
  • 8
  • Could you please try to provide an XML example document that is well-formed? Next, there are some inconsistencies in your comment; for example, the `w` element does not exist in your XML data (I assume it’s supposed to be `word`). – Christian Grün Jan 15 '20 at 21:35
  • @ChristianGrün Sorry about that. I updated the XML example to be more accurate. – sheesania Jan 15 '20 at 21:43
  • 2
    One reason it's slow is that in the expression `ancestor::sentence[1] = $sentence`, you probably should have used `is` rather than `=`. Comparing nodes by identity is likely to be much faster than comparing their string values, especially when they have many descendants, – Michael Kay Jan 15 '20 at 23:21

1 Answers1

6

The following axis is expensive indeed, as it spans all subsequent nodes of a document that are no descendants and no ancestors.

The node comparison operators (<<, >>, is) may help you here. In the code example below, it is checked if there is at least one verb that is followed by a noun:

for $sentence in //sentence
let $words1 := $sentence//w[@class = 'verb']
let $words2 := $sentence//w[@class = 'noun']
where some $w1 in $words1 satisfies 
      some $w2 in $words2 satisfies $w1 << $w2
return $sentence
Christian Grün
  • 6,012
  • 18
  • 34
  • Yup, that is working great! Thanks a lot :) For anyone else looking at this in the future, to grab the words that match, I'm using: `let $matching_words := for $word1 in $sentence//w[criteria] for $word2 in $sentence//w[criteria] where $word1 << $word2 return ($word1, $word2)` and then checking `where $matching_words` – sheesania Jan 15 '20 at 22:08