2

Sharing Sample XML file. Need to convert this fie to CSV, even if extra tags are added in this file. {without using tag names}. And XML file tag names should be used as column names while converting it to CSV}

Example Data:

<?xml version="1.0" encoding="UTF-8"?>

<Food>
    <Info>
        <Msg>Food Store items.</Msg>
    </Info>

    <store slNo="1">
        <foodItem>meat</foodItem>
        <price>200</price>
        <quantity>1kg</quantity>
        <discount>7%</discount>
    </store>

    <store slNo="2">
        <foodItem>fish</foodItem>
        <price>150</price>
        <quantity>1kg</quantity>
        <discount>5%</discount>
    </store>

    <store slNo="3">
        <foodItem>egg</foodItem>
        <price>100</price>
        <quantity>50 pieces</quantity>
        <discount>5%</discount>
    </store>

    <store slNo="4">
        <foodItem>milk</foodItem>
        <price>50</price>
        <quantity>1 litre</quantity>
        <discount>3%</discount>
    </store>

</Food>

Tried Below code but getting error with same.

import xml.etree.ElementTree as ET
import pandas as pd

ifilepath = r'C:\DATA_DIR\feeds\test\sample.xml'
ofilepath = r'C:\DATA_DIR\feeds\test\sample.csv'
root = ET.parse(ifilepath).getroot()

print(root)
with open(ofilepath, "w") as file:
    for child in root:
        print(child.tag, child.attrib)
        # naive example how you could save to csv line wise
        file.write(child.tag+";"+child.attrib)

Above code is able to find root node, but unable to concatenate its attributes though

Tried one more code, but this works for 1 level nested XML, who about getting 3-4 nested tags in same XML file. And currently able to print values of all tags and their text. need to convert these into relational model { CSV file}

import xml.etree.ElementTree as ET

tree = ET.parse(ifilepath)
root = tree.getroot()
for member in root.findall('*'):
    print(member.tag,member.attrib)
    for i in (member.findall('*')):
        print(i.tag,i.text)

Above example works well with pandas read_xml { using lxml parser}

But when we try to use the similar way out for below XML data, it doesn't produce indicator ID value and Country ID value as output in CSV file

Example Data ::

<?xml version="1.0" encoding="UTF-8"?>
<du:data xmlns:du="http://www.dummytest.org" page="1" pages="200" per_page="20" total="1400" sourceid="5" sourcename="Dummy ID Test" lastupdated="2022-01-01">
   <du:data>
      <du:indicator id="AA.BB">various, tests</du:indicator>
      <du:country id="MM">test again</du:country>
      <du:date>2021</du:date>
      <du:value>1234567</du:value>
      <du:unit />
      <du:obs_status />
      <du:decimal>0</du:decimal>
   </du:data>
   <du:data>
      <du:indicator id="XX.YY">testing, cases</du:indicator>
      <du:country id="DD">coverage test</du:country>
      <du:date>2020</du:date>
      <du:value>3456223</du:value>
      <du:unit />
      <du:obs_status />
      <du:decimal>0</du:decimal>
   </du:data>
</du:data>

Solution Tried ::

import pandas as pd
    
pd.read_xml(ifilepath, xpath='.//du:data', namespaces= {"du": "http://www.dummytest.org"}).to_csv(ofilepath, sep=',', index=None, header=True)

Output Got ::

indicator,country,date,value,unit,obs_status,decimal
"various, tests",test again,2021,1234567,,,0
"testing, cases",coverage test,2020,3456223,,,0

Expected output ::

indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0

Adding Example data , having usage of 2 or more xpath's. Looking for ways to convert the same using pandas to_csv()

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl'?>
<CATALOG>
    <PLANT>
    <COMMON>rose</COMMON>
    <BOTANICAL>canadensis</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>Shady</LIGHT>
    <PRICE>202</PRICE>
    <AVAILABILITY>446</AVAILABILITY>
    </PLANT>
    <PLANT>
    <COMMON>mango</COMMON>
    <BOTANICAL>sunny</BOTANICAL>
    <ZONE>3</ZONE>
    <LIGHT>shady</LIGHT>
    <PRICE>301</PRICE>
    <AVAILABILITY>569</AVAILABILITY>
    </PLANT>
    <PLANT>
    <COMMON>Marigold</COMMON>
    <BOTANICAL>palustris</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>Sunny</LIGHT>
    <PRICE>500</PRICE>
    <AVAILABILITY>799</AVAILABILITY>
    </PLANT>
    <PLANT>
    <COMMON>carrot</COMMON>
    <BOTANICAL>Caltha</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>sunny</LIGHT>
    <PRICE>205</PRICE>
    <AVAILABILITY>679</AVAILABILITY>
    </PLANT>
    <FOOD>
    <NAME>daal fry</NAME>
    <PRICE>300</PRICE>
    <DESCRIPTION>
    Famous daal tadka from surat
    </DESCRIPTION>
    <CALORIES>60</CALORIES>
    </FOOD>
    <FOOD>
    <NAME>Dhosa</NAME>
    <PRICE>350</PRICE>
    <DESCRIPTION>
    The famous south indian dish
    </DESCRIPTION>
    <CALORIES>80</CALORIES>
    </FOOD>
    <FOOD>
    <NAME>Khichdi</NAME>
    <PRICE>150</PRICE>
    <DESCRIPTION>
    The famous gujrati dish
    </DESCRIPTION>
    <CALORIES>40</CALORIES>
    </FOOD>
    <BOOK>
      <AUTHOR>Santosh Bihari</AUTHOR>
      <TITLE>PHP Core</TITLE>
      <GENER>programming</GENER>
      <PRICE>44.95</PRICE>
      <DATE>2000-10-01</DATE>
   </BOOK>
   <BOOK>
      <AUTHOR>Shyam N Chawla</AUTHOR>
      <TITLE>.NET Begin</TITLE>
      <GENER>Computer</GENER>
      <PRICE>250</PRICE>
      <DATE>2002-17-05</DATE>
   </BOOK>
   <BOOK>
      <AUTHOR>Anci C</AUTHOR>
      <TITLE>Dr. Ruby</TITLE>
      <GENER>Computer</GENER>
      <PRICE>350</PRICE>
      <DATE>2001-04-11</DATE>
   </BOOK>
</CATALOG>
Parfait
  • 104,375
  • 17
  • 94
  • 125
Eja
  • 45
  • 1
  • 7
  • StackOverflow is not a free code-writing service. Please [research](https://meta.stackoverflow.com/q/261592/1422451) for solutions to this regular problem and make an earnest attempt at solution. Come back with a _specific_ issue regarding your implementation. – Parfait Oct 25 '22 at 13:50
  • We understand what StackOverflow is . Have tried many a ways , but looking for a generic way to convert nested XML to CSV format. – Eja Oct 27 '22 at 12:31
  • Error :: file.write(child.tag+";"+child.attrib) TypeError: can only concatenate str (not "dict") to str Info {} – Eja Oct 27 '22 at 12:32
  • 1
    Please [edit] your post with attempted code and not in long, hard-to-read comments. Once done, please delete your comments. – Parfait Oct 27 '22 at 20:28
  • Done with formatting of tried code – Eja Nov 01 '22 at 04:44
  • Please see this [answer](https://stackoverflow.com/a/69859621/1422451). Those attributes (`id="AA.BB"` and `id="MM"`) will not be picked up by `read_xml` since they are not immediate descendants of nodes in your `xpath`. – Parfait Nov 01 '22 at 18:39

2 Answers2

1

ElementTree is not really the best tool for what I believe you're trying to do. Since you have well-formed, relatively simple xml, try using pandas:

import pandas as pd

#from here, it's just a one liner
pd.read_xml('input.xml',xpath='.//store').to_csv('output.csv',sep=',', index = None, header=True)

and that should get you your csv file.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • thanks, this works well on simple XML files. But is there a way we can convert XML files to CSV generically, without knowing its xpath. And every subtag can be appended to its previous tag name as new column. – Eja Nov 01 '22 at 14:04
  • Tried this way on shared data in edited Description , output is not providing all the column values { specially not for "indicator id="AA.BB"" and country id="MM" – Eja Nov 01 '22 at 14:05
  • @Eja Of course it doesn't work. You newly edited sample xml is nothing like the pre-edit sample xml... – Jack Fleeting Nov 01 '22 at 14:12
1

Given parsing element values and their corresponding attributes involves a second layer of iteration, consider a nested list/dict comphrehension with dictionary merge. Also, use csv.DictWriter to build CSV via dictionaries:

from csv import DictWriter
import xml.etree.ElementTree as ET

ifilepath = "Input.xml"

tree = ET.parse(ifilepath)
nmsp = {"du": "http://www.dummytest.org"}

data = [
     {
       **{el.tag.split('}')[-1]: (el.text.strip() if el.text is not None else None) for el in d.findall("*")},
       **{f"{el.tag.split('}')[-1]} {k}":v for el in d.findall("*") for k,v in el.attrib.items()},
       **d.attrib
     }     
     for d in tree.findall(".//du:data", namespaces=nmsp)    
]

dkeys = list(data[0].keys())

with open("DummyXMLtoCSV.csv", "w", newline="") as f:
    dw = DictWriter(f, fieldnames=dkeys)
    dw.writeheader()
    
    dw.writerows(data)

Output

indicator,country,date,value,unit,obs_status,decimal,indicator id,country id
"various, tests",test again,2021,1234567,,,0,AA.BB,MM
"testing, cases",coverage test,2020,3456223,,,0,XX.YY,DD

While above will add attributes to last columns of CSV. For specific ordering, re-order the dictionaries:

data = [ ... ]

cols = ["indicator id", "indicator", "country id", "country", "date", "value", "unit", "obs_status", "decimal"]

data = [
    {k: d[k] for k in cols} for d in data
]

with open("DummyXMLtoCSV.csv", "w", newline="") as f:
    dw = DictWriter(f, fieldnames=cols)
    dw.writeheader()
    
    dw.writerows(data)

Output

indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This works well Parfait. But if i try to run the same code for a simple XML file { the first example with }. This gives me error as "**{el.tag.split('}')[1]: (el.text.strip() if el.text is not None else None) for el in d.findall("*")}, IndexError: list index out of range" – Eja Nov 03 '22 at 05:56
  • Any generic way to handle both the XML files conversion to CSV. ? – Eja Nov 03 '22 at 05:57
  • See [edit](https://stackoverflow.com/posts/74281738/revisions), adjusting index to -1 and adding the top-level attributes. – Parfait Nov 03 '22 at 13:20
  • Yes, using -1 as index, and using top-level attributes works well on simple XML file also. But can we look for some generic code, that can cover both the examples together. ? – Eja Nov 04 '22 at 06:37
  • Also, instead of writing this data in to a file, can we store the same in a dataframe. That will be an ease to then convert that dataframe in to xls,csv or any other form – Eja Nov 04 '22 at 06:39
  • Generic code is difficult for different XML files since XML is an open-ended data format and can vary in nested dimensions and levels and not limited to rectangular types like CSV with two-dimensions of row by column. You need specific XPaths to handle these two different types (which is also differ in namespaces). – Parfait Nov 04 '22 at 16:40
  • As for storing in data frame for other formats, you may be asking a different question. This solution answers your titled question and tags (XML to CSV in Python). But really, you can easily pass `data` (list of dicts) into a `pandas.DataFrame` call to render a data frame. Also, heed my [comment](https://stackoverflow.com/questions/74194876/how-can-we-convert-a-nested-xml-to-csv-in-python-dynamically-nested-xml-may-con/74281738?noredirect=1#comment131140412_74194876) above about using `read_xml` (which can still work with XSLT as link shows). – Parfait Nov 04 '22 at 16:44
  • thanks for the help Parfait . I have covered both examples with condition clause of Namespace being None or not. addition to the same simple XML example and converting the same to CSV using "pd.read_xml(ifilepath, xpath=xpath).to_csv(ofilepath, sep=',', index=None, header=True)". is there a way we can pass on 2 or more numbers of xpath's ? Speciafically, when number of xpaths would be provided as an input { can be 2,3 or 8 xpaths } – Eja Nov 09 '22 at 08:12
  • eg Data :: added in main Description – Eja Nov 09 '22 at 08:18
  • You may have to close this out (select best answer) and ask a new question instead of updating your original which can confuse future readers when OP continues to update/edit needs. Again, there is no generic form to cover all XML types! You may need to run `read_xml` multiple times and merge/concat/etc. – Parfait Nov 10 '22 at 18:14