1

I have the following "example.xml" file

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>
  <tag2>tag2<!-- comment = “this is the tag1 comment”--></tag2>
    <tag3>
        <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag3>
  </tag1>
</root>

I'd like to retrieve the comment to a specific node. For now, I'm only able to retrieve all comments from the file, using the following

from lxml import etree

tree = etree.parse("example.xml")
comments = tree.xpath('//comment()')
print(comments)

As expected, this returns all the above comments from the file in a list:

[<!-- comment = \u201cthis is the tag1 comment\u201d-->, <!-- comment = \u201cthis is the tag4 comment\u201d-->]

However, how and where do I explicitly specify the node to which I want to retrieve its comment? For example, how can I specify somewhere tag2 to only return <!-- comment = \u201cthis is the tag4 comment\u201d-->

EDIT

I have a use case where I need to iterate over each node of the XML file. If the iterator comes to a node that has more than one child with a comment, it returns all the comments of its children. For example, consider the following "example2.xml" file:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <tag1>
    <tag2>
      <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
      <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
      <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
</root>

If I follow the same steps as above, when the loop iterates at tag1/tag2, it returns all of the comments for tag3 and tag4.

I.e.:

from lxml import etree

tree = etree.parse("example2.xml")
comments = tree.xpath('tag1[1]/tag2//comment()')
print(comments)

returns

[<!-- comment = \u201cthis is the tag3 comment\u201d-->, <!-- comment = \u201cthis is the tag4 comment\u201d-->]

My two questions are therefore:

  1. How can I just return the comment of the direct node rather than including any of its children?
  2. As the result is returned in the form of a list, how can I retrieve the value/text of the comment from said list?
Adam
  • 2,384
  • 7
  • 29
  • 66
  • There is an option in Firefox to check XPATH to each node. Copying that route might help to load specific components from the DOM tree. – SebasSBM Nov 20 '19 at 16:59
  • Please ask one question at a time. Please avoid editing the question to add "follow-up questions". – mzjn Nov 20 '19 at 18:48

3 Answers3

1

You need to specify the node:

tree = etree.parse("example.xml")
comments = tree.xpath('//tag2/comment()')
print(comments)

Output:

[<!-- comment = “this is the tag1 comment”-->]

Edit:

For your nested structure, you need to iterate over the repeating tags:

tag2Elements = tree.xpath('//tag1/tag2')
for t2 in tag2Elements:
    t3Comment = t2.xpath('tag3/comment()')
    print(t2, t3Comment)

Output:

<Element tag2 at 0x1066b69b0> [<!-- comment = “this is the tag3 comment”-->]
<Element tag2 at 0x1066b6960> [<!-- comment = “this is the tag3 comment”-->]
Maurice Meyer
  • 17,279
  • 4
  • 30
  • 47
  • Thank you! This works. What if the comment was after the tag? For example 'tag2' If I use the same way by specifying the node, it returns empty – Adam Nov 20 '19 at 17:06
  • I guess you need to use the parent node. – Maurice Meyer Nov 20 '19 at 17:09
  • Thanks again. I'll give that a go. But why when I specify a parent node does it return all the comments of its children? For example, specifying 'root/tag1' returns all the comments of the children (i.e both comments in the example). Is there any way to stop this from happening? – Adam Nov 20 '19 at 17:25
  • @Adam, please edit your question and paste an example. – Maurice Meyer Nov 20 '19 at 17:53
  • I have edited my question and included follow up questions. – Adam Nov 20 '19 at 18:19
  • Thanks. Your method works for this example, but it also assumes multiple things. 1) it assumes that we know which tags have comments (which won't be the case if I'm iterating through a number of files). 2) It also assumes that tag1/tag2 for examples doesn't have a comment (which it could in another use case). The reason I say this is because I'm still wondering if there is an explicit way of checking for comments for a specific node, without having prior knowledge of which exact nodes does contain comments. – Adam Nov 20 '19 at 18:38
  • @Adam: You said **retrieve the comment to a specific node** in the question :) You can iterate over all tags as mentioned [there](https://stackoverflow.com/a/28415678/7216865) and check each tag for a comment – Maurice Meyer Nov 20 '19 at 18:53
  • Thank you! I'll give that a try. – Adam Nov 20 '19 at 23:08
  • One final question and then I'll create a new post with my follow up questions, but why does the result output that tag2 element contains the tag3 comment? Is this expected? Shouldn't it be that the tag3 element contains the tag3 comment? – Adam Nov 21 '19 at 08:57
  • No that's right, there are 2 `` elements which we are iterating over. But this depends on how you setup up your xpath expressions. **You might want to read about lxml and Xpath in general!** – Maurice Meyer Nov 21 '19 at 09:11
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202784/discussion-between-adam-and-maurice-meyer). – Adam Nov 21 '19 at 09:23
1

Change your xPath expression to //tag2/comment().

By only specifying // you're allowing comments for any tag.

Rúben
  • 435
  • 2
  • 6
1

You can get the first comment like this:

>>> from lxml import etree
>>> with open('data.xml') as fd:
...  doc = etree.parse(fd)
...
>>> doc.xpath('/root/tag1/tag2/comment()')
[<!-- comment = “this is the tag1 comment”-->]

And for the last comment:

>>> doc.xpath('/root/tag1/tag3/tag4/comment()')
[<!-- comment = “this is the tag4 comment”-->]

...and of course you can use //tag2 or //tag4 if those elements are unique and you don't want to use the full path.

larsks
  • 277,717
  • 41
  • 399
  • 399