I have an XML file that represents the syntax trees of all the sentences in a book:
<book>
<sentence>
<w class="pronoun" role="subject">
I
</w>
<wg type="verb phrase">
<w class="verb" role="verb">
like
</w>
<wg type="noun phrase" role="object">
<w class="adj">
green
</w>
<w class="noun">
eggs
</w>
</wg>
</wg>
</sentence>
<sentence>
...
</sentence>
...
</book>
This example is fake, but the point is that the actual words (the <w>
elements) are nested in unpredictable ways based on syntactic relationships.
What I'm trying to do is find <sentence>
nodes with <w>
children matching particular criteria in a certain order. For example, I may be looking for a sentence with a w[@class='pronoun']
descendant followed by a w[@class='verb']
descendant.
It's easy to find sentences that just contain both descendants, without caring about ordering:
//sentence[descendant::w[criteria1] and descendant::w[criteria2]]
I did manage to figure out this query that does what I want, which looks for a <w>
with a following <w>
matching the criteria with the same closest <sentence>
ancestor:
for $sentence in //sentence
where $sentence[descendant::w[criteria1 and
following::w[(ancestor::sentence[1] = $sentence) and criteria2]]]
return ...
...but unfortunately it's very slow, and I'm not sure why.
Is there a non-slow way to search for a node that contains descendants matching criteria in a certain order? I'm using XQuery 3.1 with BaseX. If I can't find a reasonable way to do this with XQuery, plan B is to do post-processing with Python.