1
<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>

How can I parse an XML file that looks like this? Here, I have multiple values within a single tag. I want to extract values such as "ID", and "OLD_ID" in a list or dataframe format.

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
x89
  • 2,798
  • 5
  • 46
  • 110
  • Rolled version back cause additional question provided under: https://stackoverflow.com/questions/75210241/parse-nested-xml-and-extract-attributes-tag-text-both Thanks – HedgeHog Jan 23 '23 at 13:36

3 Answers3

2

You could use BeautifulSoup and xml parser to get your goal, simply select the elements needed and iterate ResultSet to extract attribute values via .get().

with open('filename.xml', 'r') as f:
    file = f.read() 
    soup = BeautifulSoup(file, 'xml')

Example

from bs4 import BeautifulSoup
import pandas as pd

xml = '''<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>
'''
soup = BeautifulSoup(xml,'xml')


pd.DataFrame(
    [
        (e.get('id'),e.get('old_id'))
        for e in soup.select('defintion')
    ],
    columns = ['id','old_id']
)

Output

id old_id
0 1 0
1 7 1
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Could you also help with a second use case? In this case, I need to extract a combination: attributes of one tag (i.e **offer** like we did earlier), contents of some tags themselves (eg for **level, name**), and then the attributes of the first tag (timestamp) whose value would repeat across all fields. I edited the qs – x89 Jan 23 '23 at 12:01
  • To keep original question clean, this would be predestined for [asking a new question](https://stackoverflow.com/questions/ask) with exact this focus - simply drop the link in the comments to reference your new answer. would be great – HedgeHog Jan 23 '23 at 13:00
  • https://stackoverflow.com/questions/75210241/parse-nested-xml-and-extract-attributes-tag-text-both – x89 Jan 23 '23 at 13:33
0

Using python Beautiful Soup, you could parse the .xml file to a Beatuful soup object and then use .findAll('defintions'). Then loop through the tags you find and get the desired values

object.findAll('defintions')

for defintion in defintions:
    old_id = defintions['old_id']
    id = defintions['id']

references: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://linuxhint.com/parse_xml_python_beautifulsoup/

  • how do you define "object" if you are reading the content from a file? – x89 Jan 23 '23 at 11:00
  • In newer code avoid old syntax `findAll()` instead use `find_all()` or `select()` with `css selectors` - For more take a minute to [check docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names) – HedgeHog Jan 23 '23 at 11:02
  • with open('teachers.xml', 'r') as f: file = f.read() # 'xml' is the parser used. For html files, which BeautifulSoup is typically used for, it would be 'html.parser'. soup = BeautifulSoup(file, 'xml') ref : https://stackabuse.com/parsing-xml-with-beautifulsoup-in-python/ – Francisco Rodrigues Jan 23 '23 at 17:21
0

If you have a valid XML like (timestamp tag can't have a value like an attribute):

<?xml version='1.0' encoding='utf-8'?>
<root timestamp='20220113'>
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>
</defintions>
</root>

Than you can use pandas:

import pandas as pd

df = pd.read_xml('x89.xml', xpath='.//defintion')
print(df.to_string(index=False))

Output:

 id  old_id defintion
  1       0      Lang
  7       1       Eng
Hermann12
  • 1,709
  • 2
  • 5
  • 14