How to parse XML and get instances of a particular node attribute?

Question

I have many rows in XML and I'm trying to get instances of a particular node attribute.

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

How do I access the values of the attribute foobar? In this example, I want "1" and "2".

Related: [Python xml ElementTree from a string source?](https://stackoverflow.com/q/647071/3357935) — Stevoisiak, Nov 02 '17 at 16:08

score 904 · Accepted Answer · edited Apr 09 '22 at 10:39

904

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.

First build an Element instance root from the XML, e.g. with the XML function, or by parsing a file with something like:

import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()

Or any of the many other ways shown at ElementTree. Then do something like:

for type_tag in root.findall('bar/type'):
    value = type_tag.get('foobar')
    print(value)

Output:

1
2

edited Apr 09 '22 at 10:39

Mateen Ulhaq

24,552
19
101
135

answered Dec 16 '09 at 05:21

Alex Martelli

854,459
170
1,222
1,395

45

You seem to ignore xml.etree.cElementTree which comes with Python and in some aspects is faster tham lxml ("lxml's iterparse() is slightly slower than the one in cET" -- e-mail from lxml author). – John Machin Dec 16 '09 at 11:37
8

ElementTree works and is included with Python. There is limited XPath support though and you can't traverse up to the parent of an element, which can slow development down (especially if you don't know this). See [python xml query get parent](http://stackoverflow.com/questions/5373902/python-xml-query-get-parent) for details. – Samuel Nov 26 '14 at 23:01
11

`lxml` adds more than speed. It provides easy access to information such as parent node, line number in the XML source, etc. that can be very useful in several scenarios. – Saheel Godhane Jan 21 '15 at 22:23
17

Seems that ElementTree has some vulnerability issues, this is a quote from the docs: `Warning The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.` – Cristik Apr 23 '15 at 14:42
9

@Cristik This seems to be the case with most xml parsers, see the [XML vulnerabilities page](https://docs.python.org/3.4/library/xml.html#xml-vulnerabilities). – gitaarik Jun 04 '15 at 14:39
2

@paul, *neat* -- I plan to take inspiration from this (though probably within an actual Conan Doyle quote:-) in the future 3rd edition of "Python in a Nutshell" (of which as it happens I've just re-written the XML chapter to cover exclusively and thoroughly ElementTree -- just mailed that chapter's draft to the /effbot for his feedback...:-). – Alex Martelli Oct 31 '15 at 01:48
3

From the [docs](https://docs.python.org/3.5/library/xml.etree.elementtree.html#module-xml.etree.ElementTree): Changed in version 3.3: This module will use a fast implementation whenever available. The `xml.etree.cElementTree` module is deprecated. – stefanbschneider Nov 05 '16 at 10:35
!!! if you are concerned about security try defusedxml , explained why here https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python/45881375#45881375 – Artem Bernatskyi Aug 25 '17 at 12:29
1

Apparently this library prepends namespace to every tag. As seen at https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces @ZloySmiertniy – Dragas May 02 '19 at 13:32
1

Of all the solutions to use this one is the optimum one, without any shadow of doubt. This conclusion is elementary... element-tree... – mike rodent Oct 18 '19 at 17:28
There's an issue with lxml where if you parse very many xmls you will run out of memory. Developers say that it is a feature rather than a bug and will not be addressed. So if you want to use it for very many then you need to run it through a multi-processing pool and restrict max_children. Deleting a parsed xml does not free up the memory. – grofte Nov 18 '22 at 09:46

score 468 · Answer 2 · edited Apr 09 '22 at 10:52

468

minidom is the quickest and pretty straight forward.

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

Python:

from xml.dom import minidom

dom = minidom.parse('items.xml')
elements = dom.getElementsByTagName('item')

print(f"There are {len(elements)} items:")

for element in elements:
    print(element.attributes['name'].value)

Output:

There are 4 items:
item1
item2
item3
item4

edited Apr 09 '22 at 10:52

Mateen Ulhaq

24,552
19
101
135

answered Dec 16 '09 at 05:30

Ryan Christensen

7,843
1
27
25

11

How do you get the value of "item1"? For example: Value1 – swmcdonnell Feb 13 '13 at 14:03
where is the documentation for `minidom` ? I only found this but that doesn't do: http://docs.python.org/2/library/xml.dom.minidom.html – amphibient Jan 14 '14 at 20:43
1

I am also confused why it finds `item` straight from the top level of the document? wouldn't it be cleaner if you supplied it the path (`data->items`)? because, what if you also had `data->secondSetOfItems` that also had nodes named `item` and you wanted to list only one of the two sets of `item`? – amphibient Jan 14 '14 at 20:49
1

please see http://stackoverflow.com/questions/21124018/specific-pathing-for-finding-xml-elements-using-minidom-in-python – amphibient Jan 14 '14 at 21:05
The syntax won't work here you need to remove parenthesis `for s in itemlist: print(s.attributes['name'].value)` – Alex Borsody Apr 05 '17 at 05:00
The doc of this library begins with a very worrisome text: `Warning The xml.dom.minidom module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.` – jlh Nov 20 '17 at 14:49
Big recommendation for minidom. It's perfect for developers coming from other languages who know DOM through things like SimpleXML, `System.Xml`, or even just HTML DOM. – MiffTheFox Sep 21 '20 at 02:31
Use `for s in itemlist: print(s.childNodes[0].data)` to get data from This is a item ` – baponkar Jun 11 '21 at 11:17
To get the value of "item1" you do this- `element.firstChild.nodeValue` – spoonsearch Aug 25 '22 at 11:30

score 270 · Answer 3 · edited Dec 12 '19 at 23:18

270

You can use BeautifulSoup:

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

edited Dec 12 '19 at 23:18

the Tin Man

158,662
42
215
303

answered Dec 16 '09 at 05:12

YOU

120,166
34
186
219

55

three years later with bs4 this is a great solution, very flexible, especially if the source is not well formed – cedbeu Mar 19 '13 at 09:40
12

@YOU `BeautifulStoneSoup` is DEPRECIATED. Just use `BeautifulSoup(source_xml, features="xml")` – andilabs Jul 24 '16 at 11:21
8

Another 3 years later, I just tried to load XML using `ElementTree`, unfortunately it is unable to parse unless I adjust the source at places but `BeautifulSoup` worked just right away without any changes! – ViKiG Dec 22 '16 at 07:16
12

@andi You mean "deprecated." "Depreciated" means it decreased in value, usually due to age or wear and tear from normal use. – jpmc26 Sep 28 '17 at 19:17
ElementTree and minidom choked on valid XML data saying there was invalid XML data, and BeautifulSoup was able to process it just fine. +1 – leetNightshade Nov 21 '17 at 00:42
I just had to do `import BeautifulSoup`, there was no bs4 to pull from. Idk if that's because I installed it with pip or what. – leetNightshade Nov 21 '17 at 01:10
BeatifulSoup is not so easy to set up. There may be issues with parser. For such easy task I suggest to use xmltodict - see bottom answer. – Alexey Antonenko May 18 '18 at 09:14
Another 3 more years later (after @ViKiG comment) and 10 years later after the answer provided, BeautifulSoup working perfectly. – Sunil Kumar Nov 08 '19 at 10:17
I found that the following was missing from the answer: 1) install it with `pip install bs4`; 2) read a file with `with open(xml_file) as fp: y = BeautifulSoup(fp)` – kotchwane Apr 27 '20 at 13:08
2

another 3 years and now BS4 is not fast enough. Takes ages. Looking for any faster solutions – Elvin Aghammadzada Jan 29 '21 at 05:38
@andilabs this seems to be the xml parser: `features="xml"`. I now get `Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?` – Timo Jun 23 '21 at 19:50

score 109 · Answer 4 · edited Jan 31 '18 at 05:44

109

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

The relevant metrics can be found in the table below, copied from the cElementTree website:

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k

As pointed out by @jfs, cElementTree comes bundled with Python:

Python 2: from xml.etree import cElementTree as ElementTree.
Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically).

edited Jan 31 '18 at 05:44

Stevoisiak

23,794
27
122
225

answered Oct 10 '13 at 17:44

Cyrus

1,216
1
8
12

10

Are there any downsides to using cElementTree? It seems to be a no-brainer. – mayhewsw Nov 11 '14 at 21:08
6

Apparently they don't want to use the library on OS X as I have spend over 15 minutes trying to figure out where to download it from and no link works. Lack of documentation prevents good projects from thriving, wish more people would realize that. – Stunner Dec 23 '14 at 06:55
8

@Stunner: it is in stdlib i.e., you don't need to download anything. On Python 2: `from xml.etree import cElementTree as ElementTree`. On Python 3: `from xml.etree import ElementTree` (the accelerated C version is used automatically) – jfs Oct 26 '15 at 14:16
1

@mayhewsw It's more effort to figure out how to efficiently use `ElementTree` for a particular task. For documents that fit in memory, it's a lot easier to use `minidom`, and it works fine for smaller XML documents. – Asclepius Oct 08 '16 at 08:51

score 53 · Answer 5 · edited Dec 12 '19 at 23:19

53

I suggest xmltodict for simplicity.

It parses your XML to an OrderedDict;

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

edited Dec 12 '19 at 23:19

the Tin Man

158,662
42
215
303

answered Jun 12 '14 at 11:57

myildirim

2,248
2
19
25

3

Agreed. If you don't need XPath or anything complicated, this is much simpler to use (especially in the interpreter); handy for REST APIs that publish XML instead of JSON – Dan Passaro Jul 25 '14 at 18:25
9

Remember that OrderedDict does not support duplicate keys. Most XML is chock-full of multiple siblings of the same types (say, all the paragraphs in a section, or all the types in your bar). So this will only work for very limited special cases. – TextGeek Jul 17 '18 at 15:47
3

@TextGeek In this case, `result["foo"]["bar"]["type"]` is a list of all `` elements, so it is still working (even though the structure is maybe a bit unexpected). – luator Aug 30 '18 at 08:16
No updates since 2019 – kolypto Nov 10 '21 at 12:06
I just realized that no updates since 2019. We need to find an active fork. – myildirim Nov 11 '21 at 08:51

score 40 · Answer 6 · edited Jun 07 '13 at 08:15

40

lxml.objectify is really simple.

Taking your sample text:

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

Output:

{'1': 1, '2': 1}

edited Jun 07 '13 at 08:15

sandy

25
8

answered Dec 16 '09 at 10:42

Ryan Ginstrom

13,915
5
45
60

`count` stores the counts of each item in a dictionary with default keys, so you don't have to check for membership. You can also try looking at `collections.Counter`. – Ryan Ginstrom Jul 20 '14 at 21:22

score 21 · Answer 7 · edited Dec 12 '19 at 23:20

Python has an interface to the expat XML parser.

xml.parsers.expat

It's a non-validating parser, so bad XML will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4

score 19 · Answer 8 · edited Dec 12 '19 at 23:26

19

Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:

Installation:

pip install untangle

Usage:

Your XML file (a little bit changed):

<foo>
   <bar name="bar_name">
      <type foobar="1"/>
   </bar>
</foo>

Accessing the attributes with untangle:

import untangle

obj = untangle.parse('/path_to_xml_file/file.xml')

print obj.foo.bar['name']
print obj.foo.bar.type['foobar']

The output will be:

bar_name
1

More information about untangle can be found in "untangle".

Also, if you are curious, you can find a list of tools for working with XML and Python in "Python and XML". You will also see that the most common ones were mentioned by previous answers.

edited Dec 12 '19 at 23:26

the Tin Man

158,662
42
215
303

answered Mar 17 '17 at 09:10

jchanger

739
10
29

What makes untangle different from minidom? – Aaron Mann Jan 30 '20 at 00:11
I cannot tell you the difference between those two as I have not worked with minidom. – jchanger Jan 31 '20 at 08:02

score 16 · Answer 9 · answered Sep 04 '17 at 17:40

I might suggest declxml.

Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.

With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.

Parsing into Python data structures is straightforward:

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.dictionary('bar', [
        xml.array(xml.integer('type', attribute='foobar'))
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the output:

{'bar': {'foobar': [1, 2]}}

You can also use the same processor to serialize data to XML

data = {'bar': {
    'foobar': [7, 3, 21, 16, 11]
}}

xml.serialize_to_string(processor, data, indent='    ')

Which produces the following output

<?xml version="1.0" ?>
<foo>
    <bar>
        <type foobar="7"/>
        <type foobar="3"/>
        <type foobar="21"/>
        <type foobar="16"/>
        <type foobar="11"/>
    </bar>
</foo>

If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.

import declxml as xml

class Bar:

    def __init__(self):
        self.foobars = []

    def __repr__(self):
        return 'Bar(foobars={})'.format(self.foobars)


xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.user_object('bar', Bar, [
        xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the following output

{'bar': Bar(foobars=[1, 2])}

score 11 · Answer 10 · edited Dec 12 '19 at 23:28

Here a very simple but effective code using cElementTree.

try:
    import cElementTree as ET
except ImportError:
  try:
    # Python 2.5 need to import a different module
    import xml.etree.cElementTree as ET
  except ImportError:
    exit_err("Failed to import cElementTree from any known place")      

def find_in_tree(tree, node):
    found = tree.find(node)
    if found == None:
        print "No %s in file" % node
        found = []
    return found  

# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
    dom = ET.parse(open(def_file, "r"))
    root = dom.getroot()
except:
    exit_err("Unable to open and parse input definition file: " + def_file)

# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")

This is from "python xml parse".

score 11 · Answer 11 · edited Dec 12 '19 at 23:58

11

XML:

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

Python code:

import xml.etree.cElementTree as ET

tree = ET.parse("foo.xml")
root = tree.getroot() 
root_tag = root.tag
print(root_tag) 

for form in root.findall("./bar/type"):
    x=(form.attrib)
    z=list(x)
    for i in z:
        print(x[i])

Output:

foo
1
2

edited Dec 12 '19 at 23:58

the Tin Man

158,662
42
215
303

answered Jul 09 '18 at 07:35

Ahito

333
3
8
15

G M · Answer 12 · 2020-06-03T16:08:15.153

xml.etree.ElementTree vs. lxml

These are some pros of the two most used libraries I would have benefit to know before choosing between them.

xml.etree.ElementTree:

From the standard library: no needs of installing any module

lxml

Easily write XML declaration: for instance do you need to add standalone="no"?
Pretty printing: you can have a nice indented XML without extra code.
Objectify functionality: It allows you to use XML as if you were dealing with a normal Python object hierarchy.node.
sourceline allows to easily get the line of the XML element you are using.
you can use also a built-in XSD schema checker.

score 11 · Answer 13 · edited Oct 29 '20 at 17:05

There's no need to use a lib specific API if you use python-benedict. Just initialize a new instance from your XML and manage it easily since it is a dict subclass.

Installation is easy: pip install python-benedict

from benedict import benedict as bdict

# data-source can be an url, a filepath or data-string (as in this example)
data_source = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

data = bdict.from_xml(data_source)
t_list = data['foo.bar'] # yes, keypath supported
for t in t_list:
   print(t['@foobar'])

It supports and normalizes I/O operations with many formats: Base64, CSV, JSON, TOML, XML, YAML and query-string.

It is well tested and open-source on GitHub. Disclosure: I am the author.

score 9 · Answer 14 · edited Dec 12 '19 at 23:57

9

import xml.etree.ElementTree as ET
data = '''<foo>
           <bar>
               <type foobar="1"/>
               <type foobar="2"/>
          </bar>
       </foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
    print item.get('foobar')

This will print the value of the foobar attribute.

edited Dec 12 '19 at 23:57

the Tin Man

158,662
42
215
303

answered Feb 20 '17 at 15:56

Souvik Dey

653
1
9
18

score 4 · Answer 15 · edited Jan 13 '22 at 07:12

simplified_scrapy: a new lib, I fell in love with it after I used it. I recommend it to you.

from simplified_scrapy import SimplifiedDoc
xml = '''
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
'''

doc = SimplifiedDoc(xml)
types = doc.selects('bar>type')
print (len(types)) # 2
print (types.foobar) # ['1', '2']
print (doc.selects('bar>type>foobar()')) # ['1', '2']

Here are more examples. This lib is easy to use.

score 2 · Answer 16 · answered Feb 03 '23 at 05:42

I am wounder, that no one suggest pandas. Pandas have a function read_xml(), what is perfect for such flat xml structures.

import pandas as pd

xml = """<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

df = pd.read_xml(xml, xpath=".//type")
print(df)

Output:

   foobar
0       1
1       2

Siraj · Answer 17 · 2020-02-20T12:56:52.877

#If the xml is in the form of a string as shown below then
from lxml  import etree, objectify
'''sample xml as a string with a name space {http://xmlns.abc.com}'''
message =b'<?xml version="1.0" encoding="UTF-8"?>\r\n<pa:Process xmlns:pa="http://xmlns.abc.com">\r\n\t<pa:firsttag>SAMPLE</pa:firsttag></pa:Process>\r\n'  # this is a sample xml which is a string


print('************message coversion and parsing starts*************')

message=message.decode('utf-8') 
message=message.replace('<?xml version="1.0" encoding="UTF-8"?>\r\n','') #replace is used to remove unwanted strings from the 'message'
message=message.replace('pa:Process>\r\n','pa:Process>')
print (message)

print ('******Parsing starts*************')
parser = etree.XMLParser(remove_blank_text=True) #the name space is removed here
root = etree.fromstring(message, parser) #parsing of xml happens here
print ('******Parsing completed************')


dict={}
for child in root: # parsed xml is iterated using a for loop and values are stored in a dictionary
    print(child.tag,child.text)
    print('****Derving from xml tree*****')
    if child.tag =="{http://xmlns.abc.com}firsttag":
        dict["FIRST_TAG"]=child.text
        print(dict)


### output
'''************message coversion and parsing starts*************
<pa:Process xmlns:pa="http://xmlns.abc.com">

    <pa:firsttag>SAMPLE</pa:firsttag></pa:Process>
******Parsing starts*************
******Parsing completed************
{http://xmlns.abc.com}firsttag SAMPLE
****Derving from xml tree*****
{'FIRST_TAG': 'SAMPLE'}'''

Please also include some context explaining how your answer solves the issue. Code-only answers aren't encouraged. — Pedram Parsian, Feb 20 '20 at 03:57

Liju · Answer 18 · 2020-08-26T08:18:05.317

If you don't want to use any external libraries or 3rd party tools, Please try below code.

This will parse xml into python dictionary
This will parse xml attrbutes as well
This will also parse empty tags like <tag/> and tags with only attributes like <tag var=val/>

Code

import re

def getdict(content):
    res=re.findall("<(?P<var>\S*)(?P<attr>[^/>]*)(?:(?:>(?P<val>.*?)</(?P=var)>)|(?:/>))",content)
    if len(res)>=1:
        attreg="(?P<avr>\S+?)(?:(?:=(?P<quote>['\"])(?P<avl>.*?)(?P=quote))|(?:=(?P<avl1>.*?)(?:\s|$))|(?P<avl2>[\s]+)|$)"
        if len(res)>1:
            return [{i[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,i[1].strip())]},{"$values":getdict(i[2])}]} for i in res]
        else:
            return {res[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,res[1].strip())]},{"$values":getdict(res[2])}]}
    else:
        return content

with open("test.xml","r") as f:
    print(getdict(f.read().replace('\n','')))

Sample input

<details class="4b" count=1 boy>
    <name type="firstname">John</name>
    <age>13</age>
    <hobby>Coin collection</hobby>
    <hobby>Stamp collection</hobby>
    <address>
        <country>USA</country>
        <state>CA</state>
    </address>
</details>
<details empty="True"/>
<details/>
<details class="4a" count=2 girl>
    <name type="firstname">Samantha</name>
    <age>13</age>
    <hobby>Fishing</hobby>
    <hobby>Chess</hobby>
    <address current="no">
        <country>Australia</country>
        <state>NSW</state>
    </address>
</details>

Output (Beautified)

[
  {
    "details": [
      {
        "@attributes": [
          {
            "class": "4b"
          },
          {
            "count": "1"
          },
          {
            "boy": ""
          }
        ]
      },
      {
        "$values": [
          {
            "name": [
              {
                "@attributes": [
                  {
                    "type": "firstname"
                  }
                ]
              },
              {
                "$values": "John"
              }
            ]
          },
          {
            "age": [
              {
                "@attributes": []
              },
              {
                "$values": "13"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Coin collection"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Stamp collection"
              }
            ]
          },
          {
            "address": [
              {
                "@attributes": []
              },
              {
                "$values": [
                  {
                    "country": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "USA"
                      }
                    ]
                  },
                  {
                    "state": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "CA"
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "details": [
      {
        "@attributes": [
          {
            "empty": "True"
          }
        ]
      },
      {
        "$values": ""
      }
    ]
  },
  {
    "details": [
      {
        "@attributes": []
      },
      {
        "$values": ""
      }
    ]
  },
  {
    "details": [
      {
        "@attributes": [
          {
            "class": "4a"
          },
          {
            "count": "2"
          },
          {
            "girl": ""
          }
        ]
      },
      {
        "$values": [
          {
            "name": [
              {
                "@attributes": [
                  {
                    "type": "firstname"
                  }
                ]
              },
              {
                "$values": "Samantha"
              }
            ]
          },
          {
            "age": [
              {
                "@attributes": []
              },
              {
                "$values": "13"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Fishing"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Chess"
              }
            ]
          },
          {
            "address": [
              {
                "@attributes": [
                  {
                    "current": "no"
                  }
                ]
              },
              {
                "$values": [
                  {
                    "country": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "Australia"
                      }
                    ]
                  },
                  {
                    "state": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "NSW"
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
]

It's a good method, but the result it returns is not convenient to use. — yazz, Oct 23 '20 at 09:34

Siraj · Answer 19 · 2020-02-20T12:46:58.280

If the source is an xml file, say like this sample

<pa:Process xmlns:pa="http://sssss">
        <pa:firsttag>SAMPLE</pa:firsttag>
    </pa:Process>

you may try the following code

from lxml import etree, objectify
metadata = 'C:\\Users\\PROCS.xml' # this is sample xml file the contents are shown above
parser = etree.XMLParser(remove_blank_text=True) # this line removes the  name space from the xml in this sample the name space is --> http://sssss
tree = etree.parse(metadata, parser) # this line parses the xml file which is PROCS.xml
root = tree.getroot() # we get the root of xml which is process and iterate using a for loop
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]

dict={}  # a python dictionary is declared
for elem in tree.iter(): #iterating through the xml tree using a for loop
    if elem.tag =="firsttag": # if the tag name matches the name that is equated then the text in the tag is stored into the dictionary
        dict["FIRST_TAG"]=str(elem.text)
        print(dict)

Output would be

{'FIRST_TAG': 'SAMPLE'}

Hermann12 · Answer 20 · 2023-05-21T10:39:05.173

0

With iterparse() you can catch the tag attribute dictionary value:

import xml.etree.ElementTree as ET
from io import StringIO

xml = """<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

file = StringIO(xml)

for event, elem in ET.iterparse(file, ("end",)):
    if event == "end" and elem.tag == "type":
        print(elem.attrib["foobar"])

edited May 21 '23 at 10:39

answered May 21 '23 at 10:18

Hermann12

1,709
2
5
14

How to parse XML and get instances of a particular node attribute?

20 Answers20

xml.etree.ElementTree vs. lxml

xml.etree.ElementTree:

lxml

Linked

Related