1

I am trying to extract elements from an XML list using the Python etree library and to finish generating an output JSON with these elements.

The idea is to pass it a series of XPATH to extract the elements I want. I don't want to go through all the elements in the XML as there are a lot of them.

The XML looks something similar to this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Line xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Data>
        <Date>2020-01-02</Date>
        <Id>id_1</Id>
        <CodDevice>567</CodDevice>
        <DataList>
            <Item>
                <Row>1</Row>
                <Value>34.67</Value>
                <Description>WHEELS</Description>
                <Tag>tag1</Tag>
            </Item>
            <Item>
                <Row>2</Row>
                <Value>38.04</Value>
                <Description>MOTOR</Description>
                <Tag>tag1</Tag>
            </Item>
        </DataList>
        <MetaList>
            <Metadata>
                <Row>1</Row>
                <Value>some value</Value>
            </Metadata>
        </MetaList>
    </Data>
</Line> 

the approach I am considering is as follows:

import xml.etree.ElementTree as ET
import json

data = """<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Line xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Data>
        <Date>2020-01-02</Date>
        <Id>id_1</Id>
        <CodDevice>567</CodDevice>
        <DataList>
            <Item>
                <Row>1</Row>
                <Value>34.67</Value>
                <Description>WHEELS</Description>
                <Tag>tag1</Tag>
            </Item>
            <Item>
                <Row>2</Row>
                <Value>38.04</Value>
                <Description>MOTOR</Description>
                <Tag>tag1</Tag>
            </Item>
        </DataList>
        <MetaList>
            <Metadata>
                <Row>1</Row>
                <Value>some value</Value>
            </Metadata>
        </MetaList>
    </Data>
</Line>     
"""

tag_list = [
'./Data/Date',
'./Data/Id',
'./Data/CodDevice',
'./Data/DataList/Item/Row',
'./Data/DataList/Item/Value',
'./Data/DataList/Item/Description',
'./Data/MetaList/Metadata/Row',
'./Data/MetaList/Metadata/Value'
]

elem_dict= {}
  
parser = ET.XMLParser(encoding="utf-8")
root = ET.fromstring(data, parser=parser)

for tag in tag_list:
    for item in root.findall(tag):
        elem_dict[item.tag] = item.text
print(json.dumps(elem_dict))

As you can see, I try to generate a JSON which, as I pass the XPATH to the list elements, overwrites them, generating the following output:

{"Date": "2020-01-02", "Id": "id_1", "CodDevice": "567", "Row": "1", "Value": "some value", "Description": "MOTOR"}

But what I would like to get is something similar to:

{"Id":"id_1","CodDevice":"567","DataList":[{"Row":1,"Value":34.67,"Description":"WHEELS"}, {"Row":2,"Value":38.04,"Description":"MOTOR"}],"MetaList":[{"Row":1,"Value":some value}]}

I don't know in detail what capabilities I can use the library for, maybe there is a more efficient way to achieve this and I am overlooking it...

Any ideas on how to approach this would be great. Thank you very much!

basigow
  • 145
  • 1
  • 11

2 Answers2

0

Your task involves:

  • filtering of the source XML tree,
  • changing names of elements and their structure (e.g. Item elements to elements of a list)
  • generating a "multi-level" (nested) output.

This it why I think that the most natural approach is to write some custom code.

Start from a function getting the text of an XML element (it will be used further):

def getTxt(elem):
    return elem.text.strip()

Then define another function to add children to a dictionary:

def addChildren(dct, elem, childNames, fn=getTxt):
    for it in elem:
        tag = it.tag
        if tag in childNames:
            dct[tag] = fn(it)

Parameters:

  • dct - the dictionary to add content to.
  • elem - the source element.
  • childNames - names of children to look for in elem and serve.
  • fn - a function generating the content for each element.

To get the content for both lists, define yet another function:

def getItems(elem):
    lst = []
    for it in elem:
        dct = {}
        addChildren(dct, it, ['Row', 'Value', 'Description'])
        lst.append(dct)
    return lst

And the last step is the main code, assuming that you have your XML tree in root:

dct = {}
nd = root.find('Data')
addChildren(dct, nd, ['Date', 'Id', 'CodDevice'])
addChildren(dct, nd, ['DataList', 'MetaList'], getItems)

Now dct contains (after some reformatting):

{
  'Date': '2020-01-02',
  'Id': 'id_1',
  'CodDevice': '567',
  'DataList': [
    {'Row': '1', 'Value': '34.67', 'Description': 'WHEELS'},
    {'Row': '2', 'Value': '38.04', 'Description': 'MOTOR'}
  ],
  'MetaList': [
    {'Row': '1', 'Value': 'some value'}
  ]
}

If you want to save it as a JSON string, run json.dump or json.dumps.

I'm not sure whether the output should contain Date key (your tag_list contains it, but the expected output doesn't). If it is not needed, remove 'Date' from the first childNames.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

Consider dictionary merging via a dict comprehension:

data = root.find('.//Data')
elem_dict = {
              **{d.tag: d.text.strip() for d in data.findall('*') if d.text.strip() != ""},
              **{'DataList': [{i.tag: i.text.strip() for i in item.findall('*') if i.tag != 'Tag'} 
                               for item in data.findall('.//DataList/Item')]},
              **{'MetalList': [{m.tag: m.text.strip() for m in meta.findall('*')} 
                                for meta in data.findall('.//MetaList/Metadata')]}
             } 

print(json.dumps(elem_dict))
# {"Date": "2020-01-02", "Id": "id_1", "CodDevice": "567", 
#  "DataList": [{"Row": "1", "Value": "34.67", "Description": "WHEELS"}, 
#               {"Row": "2", "Value": "38.04", "Description": "MOTOR"}], 
#  "MetalList": [{"Row": "1", "Value": "some value"}]}
Parfait
  • 104,375
  • 17
  • 94
  • 125