1

Trying to parse xml-request from some program I've got quite complicated architecture. It is dict of dicts of dict of dict. Some of dicts contains also lists of dicts. But due to too uncomfortable structure my dict contains lots of "garbage" words "begin_" and "value" throughout its depth.

For example:

<depart>
                                                                                                                                                                                                
    <BEGIN_>
                                                                                                                                                                                                               
        <id Value=""/>
                                                                                                                                                                                                                                       
        <code Value=""/>
                                                                                                                                                                                                               
        <name Value=""/>
                                                                                                                                                                                                                       
        <declNameList/>
                                                                                                                                                                                                
    </BEGIN_>
                                                                                                                                                                                 
</depart>

has transformed to

{'depart': {'BEGIN_': {'id': {'Value': ''},
             'code': {'Value': ''},
             'name': {'Value': ''},
             'declNameList': None}}}}}

and I need:

{'depart': {'id': '',
             'code': '',
             'name': '',
             'declNameList': None}}

May you pls help me to remove this trash using full-depth recursion? At the moment I managed to transform h = {'status': {'BEGIN_': {'statusCode': {'Value': '0'}}}} to {'status': {'statusCode': {'Value': '0'}}} by using:

if 'Value' in h['status'].keys():
    h['status'] = h['status']['Value']
if 'BEGIN_' in h['status'].keys():
    h['status'] = h['status']['BEGIN_']

But I need to apply this kind of filter to the whole dictionary.

Rumotameru
  • 111
  • 1
  • 8
  • What specifically do you want to remove? What is your expected output? Or you just want to [flatten the dictionary](https://stackoverflow.com/questions/6027558/flatten-nested-dictionaries-compressing-keys)? – Niel Godfrey Pablo Ponciano Sep 30 '21 at 14:17
  • @NielGodfreyPonciano, I've update my question with example – Rumotameru Sep 30 '21 at 14:34
  • How do you do the transformation from XML to the dictionary in your first example? Surely, all you need to do is modify that code to (effectively) skip the BEGIN_ element –  Sep 30 '21 at 14:40
  • Thank you for posting that link. Now you know what code you need to modify. Note also that that code is flawed when presented with XML data that contains identically named elements at the same level –  Sep 30 '21 at 14:48

1 Answers1

1

As in the comments, solving the problem during parsing of the XML would be preferable if it can be done. Otherwise, we can use a non-recursive solution with queues to enqueue each inner/nested element of the document and remove the BEGIN_ and Value respectively:

xml_dict = {
    'depart': {
        'BEGIN_': {
            'id1': {'Value': '11'},
            'code1': {'Value': '11'},
            'name1': {'Value': '11'},
            'declNameList1': None
        }
    },
    'BEGIN_': {
        "1": [
            {
                'id2': {'Value': '22'},
                'code2': {'Value': '22'},
                'name2': {'Value': '22'},
                'declNameList2': None
            },
            {
                'id3': {'Value': '33'},
                'code3': {'Value': '33'},
                'name3': {'Value': '33'},
                'declNameList3': None
            },
        ],
        "2": [
            {
                'id4': {'Value': '44'},
                'code4': {'Value': '44'},
                'name4': {'Value': '44'},
                'declNameList4': {
                    'code5': {'Value': '55'}
                },
            },
            {
                'id6': {'Value': '66'},
                'code6': {'Value': '66'},
                'name6': {'Value': '66'},
                'declNameList6': {
                    'code7': {
                        'BEGIN_': {
                            'name8': {'Value': '8'}
                        }
                    }
                },
            },
            {
                'any1': {'Value': '1'}
            },
            [
                {
                    "BEGIN_": {
                        'any2': {'Value': '2'}
                    },
                },
                {
                    "BEGIN_": {
                        'any3': {'Value': '3'}
                    },
                }
            ]
        ]
    }
}


queue = [xml_dict]

while queue:
    data = queue.pop()

    if isinstance(data, dict):
        if begin_value := data.pop("BEGIN_", None):
            data.update(begin_value)
            
        for key, value in data.items():
            if isinstance(value, dict) and value.keys() == {"Value"}:
                data[key] = value["Value"]
            elif isinstance(value, (dict, list)):
                queue.append(value)

    elif isinstance(data, list):
        for item in data:
            if isinstance(item, (dict, list)):
                queue.append(item)

print(xml_dict)

Output

{
    "depart": {
        "id1": "11",
        "code1": "11",
        "name1": "11",
        "declNameList1": None
    },
    "1": [
        {
            "id2": "22",
            "code2": "22",
            "name2": "22",
            "declNameList2": None
        },
        {
            "id3": "33",
            "code3": "33",
            "name3": "33",
            "declNameList3": None
        }
    ],
    "2": [
        {
            "id4": "44",
            "code4": "44",
            "name4": "44",
            "declNameList4": {
                "code5": "55"
            }
        },
        {
            "id6": "66",
            "code6": "66",
            "name6": "66",
            "declNameList6": {
                "code7": {
                    "name8": "8"
                }
            }
        },
        {
            "any1": "1"
        },
        [
            {
                "any2": "2"
            },
            {
                "any3": "3"
            }
        ]
    ]
}