I have an issue with transforming XML to DataFrame. I have the following sample XML:
<Fruits>
<Fruit ReferenceDate="2022-09-22"
FruitName="Apple">
<Identifier FruitIdentifier="111"
FruitBrand="GoldenApple"/>
<FruitInformation Country="Turkey"
Colour="Green"/>
<CompanyInformation CompanyName="GlobalFruits"
Location="USA"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 1.</FreeText1>
<FreeText2>Sample sentence 2.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
<Fruit ReferenceDate="2022-09-22"
FruitName="Orange">
<Identifier FruitIdentifier="222"
FruitBrand="BestOrange"/>
<FruitInformation Country="Egypt"
Colour="Orange"/>
<CompanyInformation CompanyName="FreshFood"
Location="UK"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 3.</FreeText1>
<FreeText2>Sample sentence 4.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
</Fruits>
I want to transform it to the DataFrame. The final table should look like the table on the image below:
I'm sorry in advance if it's a duplicate question, but I did not find the solution which fits me.
So far I have the following code:
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("fruits.xml")
xroot = xtree.getroot()
df_cols = ["ReferenceDate", "FruitName", "FruitIdentifier",
"FruitBrand", "Country", "Colour", "CompanyName",
"Location", "CountryId", "LanguageId"]
rows = []
for node in xroot.iter():
ReferenceDate = node.attrib.get("ReferenceDate")
FruitName = node.attrib.get("FruitName")
FruitIdentifier = node.attrib.get("FruitIdentifier")
FruitBrand = node.attrib.get("FruitBrand")
Country = node.attrib.get("Country")
Colour = node.attrib.get("Colour")
CompanyName = node.attrib.get("CompanyName")
Location = node.attrib.get("Location")
CountryId = node.attrib.get("CountryId")
LanguageId = node.attrib.get("LanguageId")
rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName,
"FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
"Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
"CountryId": CountryId, "LanguageId": LanguageId})
out_df = pd.DataFrame(rows, columns = df_cols)
I have two main issues:
- Cannot get texts(FreeText1 and FreeText2);
- Each group of attributes within subquery has its own row.