0

I have many xml files that I will convert to JSON and then load it into openRefine or pandas dataframe for analysis. The xml file look like

                <NATURE_QUANTITY_SCOPE>
                    <TOTAL_QUANTITY_OR_SCOPE>
                        <P>Entreprisens omfang:</P>
                        <P>Arbeidet omfatter bl.a følgende:</P>
                        <P>•    Ramming av stålrør</P>
                        <P>•    Løsmassearbeider, graving over og under vann, erosjonssikring</P>
                        <P>•    Forskalings-, armerings, og betongarbeider i stålrørspeler, kaipir og bru</P>
                        <P>•    Elektroarbeider </P>
                        <P>Arbeidet består bl.a av levering og montering av:</P>
                        <P>•    aggregathus</P>
                        <P>•    pullere og T-pullere</P>
                        <P>•    lodd til redningsleider</P>
                        <P>•    dumperdekk</P>
                        <P>•    aggregat og sylindere</P>
                        <P>•    sperrebom</P>
                        <P>Videre består arbeidet bl.a av mottak og montering av: </P>
                        <P>•    brulager inkl. fester til landkar, fendring     </P>
                        <P>•    heisetårn </P>
                        <P>•    sikringsbjelke</P>
                        <P>•    horisontale stålrør</P>
                        <P>•    komplette fenderpanel med innstøpingsgods/kjetting/gummifendere etc.</P>
                        <P>•    innstøpingsgods for dumperdekk</P>
                        <P>•    innstøpingsgods for overgangsplate</P>
                        <P>•    innstøpingsgods for horisontale stålrør</P>
                        <P>•    alle bolter for innstøpingsgods/vemohylser/skruer etc.</P>
                        <P>•    redningsleider</P>
                        <P>•    rekkverk og port kai</P>
                        <P>•    fotlist kai</P>
                    </TOTAL_QUANTITY_OR_SCOPE>
                </NATURE_QUANTITY_SCOPE>

I have tried this code

import xmltodict
import os
import json 
path = r"C:\Users\ujorbjo00\Documents\xmltodict test"
for filename in os.listdir(path):
    if not filename.endswith('.xml'):
        continue

    fullname = os.path.join(path, filename)

    with open(fullname, 'r', encoding='utf_8') as f:
        xmlString = f.read()

    jsonString = json.dumps(xmltodict.parse(xmlString, encoding='utf-8',process_namespaces=True, xml_attribs=True))

    with open(fullname[:-4] + ".json", 'w', encoding='utf_8') as f:
        f.write(jsonString)

but the JSON fil look like

"NATURE_QUANTITY_SCOPE": {
                        "TOTAL_QUANTITY_OR_SCOPE": {
                            "P": ["Entreprisens omfang:", "Arbeidet omfatter bl.a f\u00f8lgende:", "\u2022\tRamming av st\u00e5lr\u00f8r", "\u2022\tL\u00f8smassearbeider, graving over og under vann, erosjonssikring", "\u2022\tForskalings-, armerings, og betongarbeider i st\u00e5lr\u00f8rspeler, kaipir og bru", "\u2022\tElektroarbeider", "Arbeidet best\u00e5r bl.a av levering og montering av:", "\u2022\taggregathus", "\u2022\tpullere og T-pullere", "\u2022\tlodd til redningsleider", "\u2022\tdumperdekk", "\u2022\taggregat og sylindere", "\u2022\tsperrebom", "Videre best\u00e5r arbeidet bl.a av mottak og montering av:", "\u2022\tbrulager inkl. fester til landkar, fendring", "\u2022\theiset\u00e5rn", "\u2022\tsikringsbjelke", "\u2022\thorisontale st\u00e5lr\u00f8r", "\u2022\tkomplette fenderpanel med innst\u00f8pingsgods/kjetting/gummifendere etc.", "\u2022\tinnst\u00f8pingsgods for dumperdekk", "\u2022\tinnst\u00f8pingsgods for overgangsplate", "\u2022\tinnst\u00f8pingsgods for horisontale st\u00e5lr\u00f8r", "\u2022\talle bolter for innst\u00f8pingsgods/vemohylser/skruer etc.", "\u2022\tredningsleider", "\u2022\trekkverk og port kai", "\u2022\tfotlist kai"]
                        }

Which is 26 rows and no encoding of the Norwegian char!

_ - DOFFIN_ESENDERS - FORM_SECTION - CONTRACT - FD_CONTRACT - OBJECT_CONTRACT_INFORMATION - QUANTITY_SCOPE - NATURE_QUANTITY_SCOPE - TOTAL_QUANTITY_OR_SCOPE

_ - DOFFIN_ESENDERS - FORM_SECTION - CONTRACT - FD_CONTRACT - OBJECT_CONTRACT_INFORMATION - QUANTITY_SCOPE - NATURE_QUANTITY_SCOPE - TOTAL_QUANTITY_OR_SCOPE - P

_ - DOFFIN_ESENDERS - FORM_SECTION - CONTRACT - FD_CONTRACT - OBJECT_CONTRACT_INFORMATION - QUANTITY_SCOPE - NATURE_QUANTITY_SCOPE - TOTAL_QUANTITY_OR_SCOPE - P – P

I well have all

in one row

1 Answers1

0

I'm not sure I understand everything in your question but for the Norwegian char issue you can have a look at this post.

As for the XML to JSON conversion into a list, that is to be expected. In the conversion, the XML tags are converted to JSON dictionary keys and keys in a dictionary are, unlike tags in XML, unique, so everything under the same tag gets dumped under the same key. It should also work the other way around - think of an HTML list: every list element gets surrounded by the same <li> tag.

If that's not the behavior you want, please specify what the desired behavior would have been.

Fantailed
  • 98
  • 5