XML file to pandas DataFrame

Question

I have an issue with transforming XML to DataFrame. I have the following sample XML:

<Fruits>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Apple">
        <Identifier FruitIdentifier="111"
                    FruitBrand="GoldenApple"/>
        <FruitInformation Country="Turkey"
                          Colour="Green"/>
        <CompanyInformation CompanyName="GlobalFruits"
                            Location="USA"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 1.</FreeText1>
                <FreeText2>Sample sentence 2.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Orange">
        <Identifier FruitIdentifier="222"
                    FruitBrand="BestOrange"/>
        <FruitInformation Country="Egypt"
                          Colour="Orange"/>
        <CompanyInformation CompanyName="FreshFood"
                            Location="UK"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 3.</FreeText1>
                <FreeText2>Sample sentence 4.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
</Fruits>

I want to transform it to the DataFrame. The final table should look like the table on the image below:

I'm sorry in advance if it's a duplicate question, but I did not find the solution which fits me.

So far I have the following code:

import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse("fruits.xml")
xroot = xtree.getroot() 

df_cols = ["ReferenceDate", "FruitName", "FruitIdentifier",
           "FruitBrand", "Country", "Colour", "CompanyName",
           "Location", "CountryId", "LanguageId"]
rows = []

for node in xroot.iter(): 
    ReferenceDate = node.attrib.get("ReferenceDate")
    FruitName = node.attrib.get("FruitName")
    FruitIdentifier = node.attrib.get("FruitIdentifier")
    FruitBrand = node.attrib.get("FruitBrand")
    Country = node.attrib.get("Country")
    Colour = node.attrib.get("Colour")
    CompanyName = node.attrib.get("CompanyName")
    Location = node.attrib.get("Location")
    CountryId = node.attrib.get("CountryId")
    LanguageId = node.attrib.get("LanguageId")
    
    rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName, 
                  "FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
                  "Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
                  "CountryId": CountryId, "LanguageId": LanguageId})

out_df = pd.DataFrame(rows, columns = df_cols)

I have two main issues:

Cannot get texts(FreeText1 and FreeText2);
Each group of attributes within subquery has its own row.

Easily done with a powershell using xml linq : https://stackoverflow.com/questions/73988322/parse-a-large-xml-by-certain-count-of-records-by-records-using-the-sax-parser-in/73989927#73989927 — jdweng, Oct 08 '22 at 09:48
Hi @mzjn, thanks for noticing. I specified my issues in the description. — Pavel Andreev, Oct 08 '22 at 12:51

score 2 · Accepted Answer · edited Oct 08 '22 at 20:24

The below works

import pandas as pd
import xml.etree.ElementTree as ET

xml = '''<Fruits>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Apple">
        <Identifier FruitIdentifier="111"
                    FruitBrand="GoldenApple"/>
        <FruitInformation Country="Turkey"
                          Colour="Green"/>
        <CompanyInformation CompanyName="GlobalFruits"
                            Location="USA"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 1.</FreeText1>
                <FreeText2>Sample sentence 2.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Orange">
        <Identifier FruitIdentifier="222"
                    FruitBrand="BestOrange"/>
        <FruitInformation Country="Egypt"
                          Colour="Orange"/>
        <CompanyInformation CompanyName="FreshFood"
                            Location="UK"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 3.</FreeText1>
                <FreeText2>Sample sentence 4.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
</Fruits>'''

root = ET.fromstring(xml)
rows = []

for f in root.findall('.//Fruit'):
    ReferenceDate = f.attrib.get("ReferenceDate")
    FruitName = f.attrib.get("FruitName")
    FruitIdentifier = f.find("Identifier").attrib.get("FruitIdentifier")
    FruitBrand = f.find("Identifier").attrib.get("FruitBrand")
    Country = f.find("FruitInformation").attrib.get("Country")
    Colour = f.find("FruitInformation").attrib.get("Colour")
    CompanyName = f.find("CompanyInformation").attrib.get("CompanyName")
    Location = f.find("CompanyInformation").attrib.get("Location")
    CountryId = f.find(".//LanguageDependent").attrib.get("CountryId")
    LanguageId = f.find(".//LanguageDependent").attrib.get("LanguageId")
    FreeText1 = f.find(".//LanguageDependent/FreeText1").text
    FreeText2 = f.find(".//LanguageDependent/FreeText2").text

    rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName,
                 "FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
                 "Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
                 "CountryId": CountryId, "LanguageId": LanguageId, "FreeText1": FreeText1,
                 "FreeText2": FreeText2})

df = pd.DataFrame(rows)
print(df)

output

ReferenceDate FruitName  ...           FreeText1           FreeText2
0    2022-09-22     Apple  ...  Sample sentence 1.  Sample sentence 2.
1    2022-09-22    Orange  ...  Sample sentence 3.  Sample sentence 4.

[2 rows x 12 columns]

score 1 · Answer 2 · edited Dec 24 '22 at 12:50

1

A bit shorter and more universal implementation:

import xml.etree.ElementTree as ET
import pandas as pd

root = ET.parse("fruits.xml")

rows = []
for fruit in root:  # <Fruit> is child node of root node <Fruits>
    row = {}
    for node in fruit.iter():
        row.update(node.attrib)
        if node.text and not node.text.isspace():
            row[node.tag] = node.text
    rows.append(row)
            
df = pd.DataFrame(rows)

This implementation uses next logic:

We add attributes of every child node to row dictionary;
If current node has some non-whitespace text we write this text to row using tag name as key.

edited Dec 24 '22 at 12:50

Martijn Pieters

1,048,767
296
4,058
3,343

answered Oct 08 '22 at 17:17

Olvin Roght

7,677
2
16
35

1

Did you test your code? – jdweng Oct 09 '22 at 08:46
@jdweng, ah, sorry, I've missed `not` while copying today's edit. Corrected, thanks. – Olvin Roght Oct 09 '22 at 12:35

score 1 · Answer 3 · answered Oct 08 '22 at 20:51

While your XML is not shallow enough for a single pandas.read_xml, the needed data is consistent enough for multiple calls that can be horizontally merged:

fruits_df = (
    pd.concat([
        pd.read_xml("fruits.xml", xpath=".//Fruit", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//Identifier", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//FruitInformation", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//CompanyInformation", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//LanguageDependent", parser="etree")
       ], 
       axis=1
    )
)

Alternatively, via list comprehension:

fruits_df = (
    pd.concat(
        [
            pd.read_xml("fruits.xml", xpath=f".//{elem}", parser="etree")
            for elem in
            ["Fruit", "Identifier", "FruitInformation", "CompanyInformation", "LanguageDependent"]
       ], 
       axis=1
    ).dropna(axis="columns")
)

Output

  ReferenceDate FruitName  FruitIdentifier   FruitBrand Country  Colour   CompanyName Location CountryId LanguageId           FreeText1           FreeText2
0    2022-09-22     Apple              111  GoldenApple  Turkey   Green  GlobalFruits      USA        GB         EN  Sample sentence 1.  Sample sentence 2.
1    2022-09-22    Orange              222   BestOrange   Egypt  Orange     FreshFood       UK        GB         EN  Sample sentence 3.  Sample sentence 4.

score 0 · Answer 4 · answered Oct 08 '22 at 11:43

Try following powershell script :

using assembly System 
using assembly System.Xml.Linq 
using assembly System.Data

$dt = New-Object System.Data.DataTable("Fruits")
$dt.Columns.Add("ReferenceDate", [DateTime]) | Out-Null
$dt.Columns.Add("FruitName", [string]) | Out-Null
$dt.Columns.Add("FruitIdentifier", [string]) | Out-Null
$dt.Columns.Add("FruitBrand", [string]) | Out-Null
$dt.Columns.Add("Country", [string]) | Out-Null
$dt.Columns.Add("Colour", [string]) | Out-Null
$dt.Columns.Add("CompanyName", [string]) | Out-Null
$dt.Columns.Add("Location", [string]) | Out-Null
$dt.Columns.Add("CountryId", [string]) | Out-Null
$dt.Columns.Add("LanguageId", [string]) | Out-Null
$dt.Columns.Add("FreeText1", [string]) | Out-Null
$dt.Columns.Add("FreeText2", [string]) | Out-Null

$Filename = "c:\temp\test.xml"
$xDoc = [System.Xml.Linq.XDocument]::Load($Filename)
$fruits = $xDoc.Descendants("Fruit")

foreach($fruit in $fruits)
{
   $row = $dt.NewRow()
   $dt.Rows.Add($row)

   $date = [DateTime]$fruit.Attribute("ReferenceDate")
   $row.ReferenceDate = $date

   $name = $fruit.Attribute("FruitName").Value
   $row.FruitName = $name

   $identifier = $fruit.Element("Identifier")
   $fruitIdentifier = $identifier.Attribute("FruitIdentifier").Value
   $row.FruitIdentifier = $fruitIdentifier
   $fruitBrand = $identifier.Attribute("FruitBrand").Value
   $row.FruitBrand = $fruitBrand
   
   $fruitInformation = $fruit.Element("FruitInformation")
   $country = $fruitInformation.Attribute("Country").Value
   $row.country = $country
   $colour = $fruitInformation.Attribute("Colour").Value
   $row.Colour = $colour
   
   $companyInformation = $fruit.Element("CompanyInformation")
   $companyName = $companyInformation.Attribute("CompanyName").Value
   $row.CompanyName = $companyName
   $location = $companyInformation.Attribute("Location").Value
   $row.Location = $location

   $languageDependent = $fruit.Descendants("LanguageDependent")
   $countryId = $languageDependent.Attribute("CountryId").Value
   $row.CountryId = $countryId
   $languageId = $languageDependent.Attribute("LanguageId").Value
   $row.languageId = $languageId
   $freeText1 = $languageDependent.Element("FreeText1").Value
   $row.FreeText1 = $freeText1
   $freeText2 = $languageDependent.Element("FreeText2").Value
   $row.FreeText2 = $freeText2


 
}
# convert the datatable to an array of strings
$table = $dt | Format-Table | Out-String -Stream

foreach ($row in $table) {
   Write-Host $row
}

Hi @jdweng, thank you! But is it possible to use powershell script within python IDE? — Pavel Andreev, Oct 08 '22 at 12:55
See following : https://stackoverflow.com/questions/21944895/running-powershell-script-within-python-script-how-to-make-python-print-the-pow — jdweng, Oct 08 '22 at 14:07

XML file to pandas DataFrame

4 Answers4