-1

I have parent-child data in a CSV, and I need to convert the hierarchy to an XML file with Python. I've searched everywhere but most solutions involve JSON or JavaScript, and aren't directly relevant to my issue. If you know about an answer I missed, please link me!

Example CSV data:

species, species_code, group, group_code
mammals, 00, animals, 0,
bears, 01, mammals, 00
horses, 02, mammals,00,
cats, 03, mammals, 00,
polar bear, 011, bears, 01,
grizzly bear, 012, bears, 01,
palomino, 021, horses, 02,
bengal tiger, 031, cats, 03,
cougar, 032, cats, 03,
lynx, 033, cats, 03
canada_lynx, 0331, lynx, 033

This data should be transformed into an XML, with parent and child groups arranged like this:

  <group>
            animals
            <group_code> 0 </group_code>
            <species>
                mammals
                <species_code > 00 </species_code>
            </species >
        </group>
        <group>
            mammals
            <group_code > 00 </group_code>
            <species>
                bears
                <species_code> 01 </species_code>
            </species>
            <species>
                horses
                <species_code> 02 </species_code>
            </species>
            <species>
                cats
                <species_code> 03 </species_code>
            </species>
        </group>
        <group>
            bears
            <group_code > 01 </group_code>
            <species>
                polar_bear
                <species_code> 011 </species_code>
            </species>
            <species>
                grizzly bear
                <species_code> 012 </species_code>
            </species>  

The pattern should remain consistent for parent-child relationships within the dataset. Thank you so much for any help you can provide. I'm still pretty new at this and appreciate any new knowledge you can provide.

crimins
  • 1
  • 2
  • It's going to take some code on your part I fear on this. I'm not sure a direct transfer of data into this format exists. What have you tried? – roganjosh Oct 26 '17 at 22:14
  • @roganjosh: I tried to make a hierarchical index with Pandas, then write the strings to the necessary XML positions. I couldn't get it to work, but I'm not scared of a lot of code if it provides the right solution. I just don't know where to start. – crimins Oct 26 '17 at 22:18
  • 1
    I'm scared of a lot of code :) You should provide an ["MCVE"](https://stackoverflow.com/help/mcve) and show your own attempts to solve this issue. – roganjosh Oct 26 '17 at 22:21
  • One thing you should consider is data structures, not file formats. Consider, abstractly, what the data structure is of data that happens to be stored as CSV. You could treat it either as a list of rows or a list of columns. Next think about your hierarchical data structure (dictionaries come to mind) and how to transform from one to the other using simple built-in types in Python. The actual serialization formats (CSV, XML, JSON, etc. are largely irrelevant to your question (though some formats are better suited than others for some types of data structures). – Iguananaut Oct 26 '17 at 22:35
  • @roganjosh: Thanks, I will provide an MCVE when I get back into the office. Don't have the work stored on this machine. – crimins Oct 26 '17 at 22:45
  • @Iguananaut: Thanks for your comment. How would I convert the data to a dictionary? How to convert the dictionary to the output format? – crimins Oct 26 '17 at 22:45
  • That's for you to think about. – Iguananaut Oct 27 '17 at 11:56

1 Answers1

2

This provides the output you want with pandas, using the fact that you can iterate over the results of pandas.groupby(), which returns something like a list of (group, df[df['grouped_column'] == group]) tuples.

I also sort the results of groupby by group_code as string so it's lexicographically sorted, thus achieving the "hierarchy" you need according to your sample output (I'm not sure if this is needed, maybe you could simply sort the column as integers).

import pandas as pd
import io
import csv
from xml.etree.ElementTree import Element, SubElement, tostring, ElementTree

sample_csv = io.StringIO("""
species, species_code, group, group_code
mammals, 00, animals, 0
bears, 01, mammals, 00
horses, 02, mammals,00
cats, 03, mammals, 00
polar bear, 011, bears, 01
grizzly bear, 012, bears, 01
palomino, 021, horses, 02
bengal tiger, 031, cats, 03
cougar, 032, cats, 03
lynx, 033, cats, 03
canada_lynx, 0331, lynx, 033""")

df = pd.read_csv(sample_csv, dtype=str)
df.columns = [col.strip() for col in df.columns]
for col in df.columns:
    df[col] = df[col].str.strip()

groups = Element('groups')
xml_tree = ElementTree(groups)

for group_group_code, df in sorted(df.groupby(['group', 'group_code']), key=lambda grp: grp[0][1]):
    group, group_code = group_group_code
    xml_group = Element('group')
    xml_group.text = group.strip()
    xml_group_code = SubElement(xml_group, 'group_code')
    xml_group_code.text = group_code.strip()
    for row in df[['species', 'species_code']].itertuples():
        xml_species = SubElement(xml_group, 'species')
        xml_species.text = row.species
        xml_species_code = SubElement(xml_species, 'species_code')
        xml_species_code.text = row.species_code
    groups.append(xml_group)

print(b''.join(tostring(gr) for gr in groups))
xml_tree.write('groups_test.xml')

The print(...) line will print exactly what you asked in your question (at least structurally, without indentation and newlines), while the xml_tree.... line will create an xml file with all your groups, but wrapped in a <groups> root element. It still writes the file without any indentation/newlines though.

If your xml output should include indentation/newlines based on this solution :

from xml.dom import minidom

print(xml.dom.minidom.parseString(tostring(groups)).toprettyxml(indent="    "))

Which as far as I can tell, is exactly what you asked (wrapped in a <groups> root tag).

Unatiel
  • 1,060
  • 1
  • 11
  • 17
  • Thanks @Unatiel! This is exactly what I needed. One question: How do I resolve the output? When I print this solution, I get : '[, , , , , ]' – crimins Nov 06 '17 at 16:17