0

I have the following XML:

<root>
  <tok lemma="per" xpos="SPS00">per</tok>
  <tok lemma="els" xpos="L3CP0">los</tok>
  <tok lemma="qual" xpos="PR0CP000">quals</tok>
  <tok lemma="ser" xpos="VSIP3P0">són</tok>
  <tok lemma="digne" xpos="AQ0CP00">dignes</tok>
  <tok lemma="de" xpos="SPC00">de</tok>
  <tok lemma="gloriós" xpos="AQ0FS00">gloriosa</tok>
  <tok lemma="memòria" xpos="NCFS000">memòria</tok>
  <tok xpos="CC" lemma="i">e</tok>
  <tok lemma="li" xpos="L3CSD" >li</tok>
  <tok lemma="plàcia" xpos="VMSP3S0">plàcia</tok>
  <tok lemma="molt" xpos="RG">molt</tok>
  <tok lemma="per" xpos="SPS00">per</tok>
</root>

I'm trying to use this XPath:

//tok[starts-with(@xpos, "L")]/preceding-sibling::tok[1][not(starts-with(@xpos, "V"))]/following-sibling::tok[1][not(starts-with(@xpos, "V"))]

to capture only the middle element in this sequence of elements:

  <tok lemma="per" xpos="SPS00">per</tok>
  <tok lemma="els" xpos="L3CP0">los</tok>
  <tok lemma="qual" xpos="PR0CP000">quals</tok>

My thinking was that this XPath requires for the condition to be met both in the preceding and the following element. I'm obviously wrong because it looks like there is a match as long as the condition applies in the preceding sibling element even if it doesn't apply in the following element. Right now my Xpath yields:

<tok lemma="els" xpos="L3CP0">los</tok>
<tok lemma="li" xpos="L3CSD">li</tok>

My goal is that the second one would be excluded because the following element has an attribute named 'xpos' with a value that starts with 'V'(you can see that in the sample XML).

What am I doing wrong? By now, I thought I had gotten the gist of XPath syntax. How does one specify in the XPath that the condition on the attribute value has to be met in the element immediately preceding it and in the one following it?

Michael M.
  • 10,486
  • 9
  • 18
  • 34
jfontana
  • 141
  • 1
  • 6

1 Answers1

0

Correct one regarding your conditions would be

//tok[starts-with(@xpos, "L") 
      and preceding-sibling::tok[1][not(starts-with(@xpos, "V"))] 
      and following-sibling::tok[1][not(starts-with(@xpos, "V"))]]

Your approach won't work because your XPath means:

  • find a node with attribute starting with "L"
  • select its preceding sibling if attribute doesn't start with "V"
  • select following sibling (the same node selected on step#1) if attribute doesn't start with "V" (obviously, it won't start with "V" since as we found out it starts with "L")
JaSON
  • 4,843
  • 2
  • 8
  • 15
  • Thanks very much, @JaSON! This reveals for me an important misunderstanding I had about XPath syntax. What I'm wondering now is whether there is a more efficient way to state this XPath expression. Once I tested your solution would work with some small XML docs I started to use it in production. When the file to be processed is around 30MB, it just takes forever to process. I'm using the LXML library which is supposed to be the fastest in Python. 30MB is not all that big of a file. I don't understand why it is so slow. – jfontana Dec 27 '22 at 11:28
  • @jfontana have you check [this](https://stackoverflow.com/questions/44880509/speeding-up-the-xml-parsing-process-using-lxml-and-xpath#answer-44880933) or something from [here](https://stackoverflow.com/questions/2365208/speeding-up-xpath)? – JaSON Dec 27 '22 at 14:06
  • 1
    Thanks again @JaSON. I had checked solutions like the ones in your first link but those don't work (using cElementTree is not faster than doing it via the default lxml.etree). In the second link I found a reference of SAX which looks promising. I had tried a while ago but I had to give it up because it was too complicated for me to integrate it into what I already have (which relies a lot on LXML). I found info on LXML support for SAX, though. So I'm going to look into it. – jfontana Dec 27 '22 at 14:59