1

In the following xml file, I have encoded the structure of a text as div elements as well as the layout information (two columns) of the book containing the text using empty pb (page beginning) and cb (column beginning) elements.

XML/TEI input:

<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
    <fileDesc>
        <titleStmt>
            <title type="main" xml:lang="en">Testfile</title>
        </titleStmt>
        <publicationStmt>
            <p>Test</p>
        </publicationStmt>
        <sourceDesc>
            <p>Testfile</p></sourceDesc>
    </fileDesc>
</teiHeader>
    
    
    <text>
        <body>
            <pb n="1r"/><fw type="header">Some header</fw>
            <cb n="a"/>
            <lb/><div n="1"><p>Line 1.1
                <lb/>Line 1.2
                <lb/>Line 1.3
                <lb/>Line 1.4
            </p></div>
            <cb n="b"/>
            <lb/><div n="2"><p>Line 2.1
                <lb/>Line 2.2
                <lb/>Line 2.3
                <lb/>Line 2.4
                <pb n="1v"/><fw type="header">Some header</fw>
                <cb n="a"/>
                <lb/>Line 1.1
                <lb/>Line 1.2
                <lb/>Line 1.3
                <lb/>Line 1.4
            </p></div>
            <cb n="b"/>
            <lb/><div n="2"><p>Line 1.1
                <lb/>Line 1.2
                <lb/>Line 1.3
                <lb/>Line 1.4
            </p></div>
        </body>
    </text>
</TEI>

What I want

Now, I want to iterate through the tree using lxml.etree and XPath to select all the lb elements of a column, f.i. all lb elements between <pb n="1r"/><fw type="header">Some header</fw><cb n="a"/> ... and the first <cb n="b"/> element thereafter.

What I have tried

I used the following xpath-expression for that:

//lb[preceding::pb[@n="1r"] and following::cb[@n="b"]]

However, it selects not only the elements expected, but also all other lb elements that are followed by a <cb n="b"/> element.

I have also tried to limit to the first occurrence of <cb n="b"/>, but it did not change the result:

//lb[preceding::pb[@n="1r"] and following::cb[@n="b"][1]]

I have already tried some similar questions such as XPath select all elements between two specific elements, but the suggested answers did not work when selecting the right pb by its @n attribute.

Can someone point me into the right direction how to select only lbs of a given column?

edit: Note: in etree, the namespace tei has to be added to the XPath expression to work with the accepted answer:

root.xpath('.//tei:lb[preceding::tei:pb[@n="2r"] and count(preceding::tei:cb[@n="b"]) = 0]', namespaces = {'tei':'http://www.tei-c.org/ns/1.0'})
np18
  • 25
  • 4

1 Answers1

0

Could you try following XPath expression:

//lb[preceding::pb[@n="1r"] and count(preceding::cb[@n='b']) = 0]

Predicate count(preceding::cb[@n='b']) = 0 should exclude lb elements followed by a element.

Alexandra Dudkina
  • 4,302
  • 3
  • 15
  • 27
  • Yes, that works! For my specific use-case, I have to increment the count by 1 to loop through the pages. Thanks a lot! – np18 Feb 03 '22 at 07:16
  • @np18: Apparently it does not quite work. You edited the Question to clarify the Answer, Don't do that. It only makes things confusing. – mzjn Feb 03 '22 at 08:00
  • ok, sorry. Just to clarify my edit above: The suggested XPath expression by Alexandra works in general (f.i. if used in OxygenXML), so it answers my question about the XPath. However, to be used with lxml.etree, tei namespace has to be provided. I should have made that clearer in my question in the first place. Will try better next time. – np18 Feb 03 '22 at 09:05
  • Yes, I understand that. I just wanted to emphasize that it is important to separate Questions and Answers (because that is how SO works). – mzjn Feb 03 '22 at 09:52