8

I am trying extract some data from a bunch of xml files. Now, the issue is the structure of all the files is not exactly the same and thus, just iterating over the children and extracting the values is difficult.

Is there a getElementByTag() method for python for such xml documents? I have seen that such a method is available for C#, C++ users but couldn't find anything for Python.

Any help will be much appreciated!

rishran
  • 596
  • 2
  • 7
  • 26

1 Answers1

20

Yes, in the package xml.etree you can find the built-in function related to XML. (also available for python2)

The one specifically you are looking for is findall.

For example:

import xml.etree.ElementTree as ET
tree = ET.fromstring(some_xml_data)
all_name_elements = tree.findall('.//name')

With:

In [1]: some_xml_data = "<help><person><name>dean</name></person></help>"

I get the following:

In [10]: tree.findall(".//name")
Out[10]: [<Element 'name' at 0x7ff921edd390>]
Dean Fenster
  • 2,345
  • 1
  • 18
  • 27
  • 5
    `findall` only searches at the children level. However, I was looking for something that goes all the way to the bottom of the tree. – rishran Jul 11 '16 at 14:29
  • 1
    If you use `findAll` for the root element of the tree, it searches all subelements. You can also use it on the ElementTree object, instead of the root element, and then it also searches the root. – Dean Fenster Jul 11 '16 at 14:31
  • 2
    That does not work for me. It only searches the child level and nothing below that. Also, your syntax is incorrect in the answer you posted. Thanks! – rishran Jul 11 '16 at 14:40
  • 1
    @codepi You're right. Got it wrong. I edited with a fix. – Dean Fenster Jul 11 '16 at 15:03
  • How can I get the text of the `name` dean? – Heinz Feb 28 '19 at 15:26
  • @Heinz - By using the `text` attribute of the element. – Dean Fenster Feb 28 '19 at 17:04
  • 2
    @DeanFenster I believe the correct syntax should be ".//name" in order to get any element named "name". "*/name" will only return grandchildren of the element. – big_bad_bison Oct 29 '20 at 10:46
  • @big_bad_bison is right; the correct syntax is `*//name` or ".//name". – Daniel Mar 25 '21 at 04:12