How to parse XML with multiple attribute values within a single tag to DataFrame?

Question

<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>

How can I parse an XML file that looks like this? Here, I have multiple values within a single tag. I want to extract values such as "ID", and "OLD_ID" in a list or dataframe format.

Rolled version back cause additional question provided under: https://stackoverflow.com/questions/75210241/parse-nested-xml-and-extract-attributes-tag-text-both Thanks — HedgeHog, Jan 23 '23 at 13:36

HedgeHog · Accepted Answer · 2023-01-23T11:25:57.577

2

You could use BeautifulSoup and xml parser to get your goal, simply select the elements needed and iterate ResultSet to extract attribute values via .get().

with open('filename.xml', 'r') as f:
    file = f.read() 
    soup = BeautifulSoup(file, 'xml')

Example

from bs4 import BeautifulSoup
import pandas as pd

xml = '''<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>
'''
soup = BeautifulSoup(xml,'xml')


pd.DataFrame(
    [
        (e.get('id'),e.get('old_id'))
        for e in soup.select('defintion')
    ],
    columns = ['id','old_id']
)

Output

	id	old_id
0	1	0
1	7	1

edited Jan 23 '23 at 11:25

answered Jan 23 '23 at 11:08

HedgeHog

22,146
4
14
36

Could you also help with a second use case? In this case, I need to extract a combination: attributes of one tag (i.e **offer** like we did earlier), contents of some tags themselves (eg for **level, name**), and then the attributes of the first tag (timestamp) whose value would repeat across all fields. I edited the qs – x89 Jan 23 '23 at 12:01
To keep original question clean, this would be predestined for [asking a new question](https://stackoverflow.com/questions/ask) with exact this focus - simply drop the link in the comments to reference your new answer. would be great – HedgeHog Jan 23 '23 at 13:00
https://stackoverflow.com/questions/75210241/parse-nested-xml-and-extract-attributes-tag-text-both – x89 Jan 23 '23 at 13:33

score 0 · Answer 2 · answered Jan 23 '23 at 10:53

0

Using python Beautiful Soup, you could parse the .xml file to a Beatuful soup object and then use .findAll('defintions'). Then loop through the tags you find and get the desired values

object.findAll('defintions')

for defintion in defintions:
    old_id = defintions['old_id']
    id = defintions['id']

references: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://linuxhint.com/parse_xml_python_beautifulsoup/

answered Jan 23 '23 at 10:53

Francisco Rodrigues

11
1

how do you define "object" if you are reading the content from a file? – x89 Jan 23 '23 at 11:00
In newer code avoid old syntax `findAll()` instead use `find_all()` or `select()` with `css selectors` - For more take a minute to [check docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names) – HedgeHog Jan 23 '23 at 11:02
with open('teachers.xml', 'r') as f: file = f.read() # 'xml' is the parser used. For html files, which BeautifulSoup is typically used for, it would be 'html.parser'. soup = BeautifulSoup(file, 'xml') ref : https://stackabuse.com/parsing-xml-with-beautifulsoup-in-python/ – Francisco Rodrigues Jan 23 '23 at 17:21

Hermann12 · Answer 3 · 2023-01-24T18:34:15.130

If you have a valid XML like (timestamp tag can't have a value like an attribute):

<?xml version='1.0' encoding='utf-8'?>
<root timestamp='20220113'>
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>
</defintions>
</root>

Than you can use pandas:

import pandas as pd

df = pd.read_xml('x89.xml', xpath='.//defintion')
print(df.to_string(index=False))

Output:

 id  old_id defintion
  1       0      Lang
  7       1       Eng

How to parse XML with multiple attribute values within a single tag to DataFrame?

3 Answers3

Example

Output