0

i am aware that this questions has been asked for example here: XPath select all elements between two specific elements

but there and in a few other google hits they use hard coded values to select specific data.

what i need would like todo is get a list of text with each parent:

<doc>
    <divider />
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <divider />
    <p>text</p>
    <p>text</p>
    <divider />
    <p>text</p>
    <divider />
</doc>

to get the first text elements you can do:

/*/p[count(preceding-sibling::divider)=1]

but what i want as ouput is something like this:

[['<doc>'], ['<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>'], ['<p>text</p>', '<p>text</p>'], ['<p>text</p>']]

now you got a list of every text element for divider 1, divider 2, divider x...

which you get from this python code:

data = open("inputfile", 'r')

matches = []
tmp = []
for line in data.readlines():
    currentLine = line.strip()
    if 'divider' in currentLine:
        if len(tmp) > 0:
            matches.append(tmp)
            tmp = []
    else:
        tmp.append(currentLine)


print(matches)

yes, theres a 'doc' at the beginning, its just an example, not perfect. so with this code you can also save the parent in the same list, in the testdate thats always divider so i did not do it.

whats the xpath magic for this?

James Baker
  • 107
  • 2
  • 11

2 Answers2

0

XPath can't return nested sequences. In XPath 3 you can return an array, and do what you want. Until then it's either (1) call XPath to find out how many groups there are, (2) call XPath once for each group, or use XSLT.

barefootliam
  • 619
  • 3
  • 7
0

In XPath 3.1 you can use e.g.

array {
let $dividers := //divider
return
  for-each-pair($dividers, tail($dividers), function($d1, $d2) {
    array { root($d1)//*[. >> $d1 and . << $d2] }
  })
}

to return an array of arrays.

Online fiddle using SaxonCHE Python package.

For Python ElementPath also supports XPath 3.1.

Example Python code using SaxonC HE:

from saxonche import PySaxonProcessor

xpath = '''array {
  let $dividers := //divider
  return
    for-each-pair($dividers, tail($dividers), function($d1, $d2) {
      array { root($d1)//*[. >> $d1 and . << $d2] }
    })
}'''

with PySaxonProcessor(license=False) as saxon:
    xpath_processor = saxon.new_xpath_processor()

    xpath_processor.set_context(file_name='sample1.xml')

    xdm_result = xpath_processor.evaluate(xpath)

    print(xdm_result)

Example code using ElementPath:

from elementpath import select
from elementpath.xpath3 import XPath3Parser

xpath = '''array {
  let $dividers := //divider
  return
    for-each-pair($dividers, tail($dividers), function($d1, $d2) {
      array { root($d1)//*[. >> $d1 and . << $d2] }
    })
}'''

root = ET.parse('sample1.xml')

result = select(root, xpath, parser=XPath3Parser)

print(result)

Actually, for ElementPath, as it already returns the sequence result of select as a Python list, it might be better not to construct an additional array in XPath but just return a sequence of arrays, as that way it is easier to unwrap the XPath result into a nested Python list of element nodes:

xpath = '''let $dividers := //divider
  return
    for-each-pair($dividers, tail($dividers), function($d1, $d2) {
      array { root($d1)//*[. >> $d1 and . << $d2] }
    })
'''
result = select(root, xpath, parser=XPath3Parser)

result_list = [array.items() for array in result]

print(result_list)
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110