1

I'm trying to parse an xml file using Python through root.findall.

Basically my file looks like this - and I'm trying to access elements under "Level3".

Edit: @trincot, already provided solution.....but, Now, I've added namespace to the sample data(xmlns="http://xyz.abc/forms"), which is causing the trouble. Why would adding 'xmlns=' cause the issue ? :O

<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xyz.abc/forms" xmlns:abc="http://bus-message-envelope" xmlns:env="http://www.w3.org/2003/05/soap-envelope" abc:version="1-2">
    <env:Header>
        <abc:col1>col1Text</abc:col1>
        <abc:col2>col2Text</abc:col2>
        <abc:col3>col3Text</abc:col3>
    </env:Header>
    <env:Body>
        <Level1>
            <Level2 schemaVersion="1-1">
                <Level3>
                    <cell1>cell1Text</cell1>
                    <cell2>cell2Text</cell2>
                    <cell3>cell3Text</cell3>
                    <cell4>cell4Text</cell4>
                </Level3>
            </Level2>
        </Level1>
    </env:Body>
</env:Envelope>

Trying this, but doesn't return anything :

from xml.etree import ElementTree
tree = ElementTree.parse("/tmp/test.xml")
root = tree.getroot()

for form in root.findall(".//Level3"):
 print(form.text)
 print("Inside Loop") --> Not even hitting this

Expected Output:

cell1Text
cell2Text
cell3Text
cell4Text

I was able to access the same elements through code below. But, how to achieve this using findall?

for x in root[1][0][0][0]:
 print(x.text)

Output:

cell1Text
cell2Text
cell3Text
cell4Text

I did go through most of Stack Overflow, but couldn't get an answer to this. Tried many things but failed :( .

Kumar
  • 13
  • 1
  • 4
  • This is out of my wheelhouse, but FWIW, you should make a [mre] with complete code. I'm wondering how you created `root` exactly. – wjandrea Aug 12 '22 at 17:49
  • 1
    Oops ! added the code on how I've created the root. – Kumar Aug 12 '22 at 18:04
  • That update changes everything. See duplicate links for how to account for namespaces in your XPath expressions. – kjhughes Aug 12 '22 at 20:30

1 Answers1

1

In the first code snippet you access form.text, but form corresponds to the Level3 element which has no other text than just white space. The actual text you want to output is sitting in its child nodes. So print(form.text) prints white space only.

The working code iterates the children of that same Level3 element:

for x in root[1][0][0][0]:
    print(x.text)

Here x is the deeper cellX element, which does have the text you expect.

To achieve this with findall do:

for x in root.findall(".//Level3/*"):
    print(x.text)

Note the extra level /* in the argument of findall, which means: any child element of Level3 elements.

See both the original and corrected code run on repl.it

If you didn't get any output with the first version, then please check spelling. It looks suspicious that the Elements in your XML sometimes start with a capital (like Level3) and sometimes not (like cell1). This could be a reason of not getting output. However, I loaded your code and XML as-is, and it produced the message "Inside Loop", as you can see when you follow the link above.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • Good, +1, but how do you account for `print("Inside Loop") --> Not even hitting this`? – kjhughes Aug 12 '22 at 18:17
  • I cannot account for that, as I cannot reproduce that problem, @kjhughes. I have added a link to `repl.it` where the asker can see for themselves that the output is generated in their first version of the code. – trincot Aug 12 '22 at 18:19
  • 1
    Right, I couldn't see how that line would not have been executed either. Thanks. – kjhughes Aug 12 '22 at 18:20
  • `Level3` in the sample has a text node: `\n `. Try `print("Text: '" + form.text + "'")` – LMC Aug 12 '22 at 18:49
  • 1
    @LMC, yes, you are right -- I should say it doesn't have text other than white space. Updated. – trincot Aug 12 '22 at 18:54
  • In fact, if indenting is removed (all in one line) the OP should get at `print("Text: '" + form.text + "'")` -> `TypeError: must be str, not NoneType`. – LMC Aug 12 '22 at 18:58
  • @LMC, could be, but that would be a kind of *output* still -- while they say they get no output. I put my bet on a spelling mistake (in their actual code or XML) whereby `findall` doesn't find anything and the loop does not iterate. – trincot Aug 12 '22 at 19:02
  • @trincot : Excellent answer, however while mocking the sample data, found that I've missed another major culprit. When I add namespace like xmlns="http://xyz.abc/forms" , the same code doesn't work again. Updated the sample data in the qsn. Appreciate your help, wonder what could be the issue when the namespace is added !! – Kumar Aug 12 '22 at 20:21
  • Adding a default namespace to an element effectively changes the element names of the element and all of its descendants that are not otherwise in a namespace. Namespaced-names have to be accounted for in XPath. See duplicate links for how. – kjhughes Aug 12 '22 at 20:29