2

I am new to python and looking for a solution to the following:

I have a file.xml that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
    <PRODUCT_DETAILS>
        <DESCRIPTION_SHORT>green cat w short hair</DESCRIPTION_SHORT>
        <DESCRIPTION_LONG>green cat w short hair and unlimited zoomies</DESCRIPTION_LONG>
    </PRODUCT_DETAILS>
    <PRODUCT_FEATURES>
        <FEATURE>
            <FNAME>Colour</FNAME>
            <FVALUE>green</FVALUE>
        </FEATURE>
        <FEATURE>
            <FNAME>Legs</FNAME>
            <FVALUE>14</FVALUE>
        </FEATURE>
    </PRODUCT_FEATURES>
    <PRODUCT_DETAILS>
        <DESCRIPTION_SHORT>blue dog w no tail</DESCRIPTION_SHORT>
        <DESCRIPTION_LONG>blue dog w no tail and unlimited zoomies</DESCRIPTION_LONG>
    </PRODUCT_DETAILS>
    <PRODUCT_FEATURES>
        <FEATURE>
            <FNAME>Colour</FNAME>
            <FVALUE>blue</FVALUE>
        </FEATURE>
        <FEATURE>
            <FNAME>Happiness Levels</FNAME>
            <FVALUE>11/10</FVALUE>
        </FEATURE>
    </PRODUCT_FEATURES>
</HEADER>

This is my code:

from lxml import etree as et
import pandas as pd

xml_data = et.parse('file2.xml')
products = xml_data.xpath('//HEADER')

headers=[elem.tag for elem in xml_data.xpath('//HEADER[1]//PRODUCT_DETAILS//*')]
headers.extend(xml_data.xpath('//HEADER[1]//FNAME/text()'))

rows = []

for product in products:

    row = [product.xpath(f'.//{headers[0]}/text()')[0],product.xpath(f'.//{headers[1]}/text()')[0]]
    
    f_values = product.xpath('.//FVALUE/text()')
    row.extend(f_values)    
    rows.append(row)

df = pd.DataFrame(rows,columns=headers)

df
# df.to_csv("File2_Export_V1.csv", index=False)

This is the output I want:

    DESCRIPTION_SHORT       DESCRIPTION_LONG                                Colour  Legs    Happiness Levels
0   green cat w short hair  green cat w short hair and unlimited zoomies    green   14      
1   blue dog w no tail      blue dog w no tail and unlimited zoomies        blue            11/10

My attempt at solving this problem was to extend one line like this:

headers=[elem.tag for elem in xml_data.xpath('//HEADER[1]//PRODUCT_DETAILS//*'),('//HEADER[2]//PRODUCT_DETAILS//*')]

Sadly, I get a syntax error and no solution.

How do I adapt my code to reflect the xml structure?

Thank you in advance! ~C

Charles
  • 57
  • 6
  • Hello. Please word what output you are looking for, eg: I want to loop over all children of the `HEADER`, check the child element, and output something. – Mahrkeenerh Mar 31 '22 at 10:10
  • Hi @Mahrkeenerh, I want to output a csv listing both the product green cat and blue dog with their respective tags e.g., DESCRIPTION_SHORT, DESCRIPTION_LONG, Colour, etc. I do not want to edit the xml. Apologies for my vague formulation, I'm still very new to coding and just beginning to understand the lingo. – Charles Mar 31 '22 at 10:38

1 Answers1

1

Probably not the best solution, but I think it's pretty straight-forward, and clear.

import xml.etree.ElementTree as ET
import pandas as pd

# Get xml object
tree = ET.parse('file2.xml')
root = tree.getroot()

# Create final DataFrame
out = pd.DataFrame()

# Loop over all products (Product = (DETAILS, FEATURES))
for i in range(0, len(root), 2):
    # Get all descriptions
    descriptions = [(child.tag, child.text) for child in root[i]]
    # Get all features
    features = [(child[0].text, child[1].text) for child in root[i + 1]]

    # Create a DataFrame, where columns are the tags, and values are, well, values
    temp_df = pd.DataFrame([[i[1] for i in descriptions + features]], columns=[i[0] for i in descriptions + features])

    # Append to final DataFrame
    out = pd.concat([out, temp_df])
Mahrkeenerh
  • 1,104
  • 1
  • 9
  • 25
  • Hi, thanks for taking the time to write up the code! When I output temp_df, I only receive the output for one of the products, namely the blue dog. Do I need to tweak anything else? – Charles Mar 31 '22 at 11:30
  • 1
    `temp_df` creates a new `DataFrame` for each new product. This is done, because you can have different column names. The final output is in `out` `DataFrame` – Mahrkeenerh Mar 31 '22 at 11:35
  • Understood. If I wanted to have all of the products in one data frame (column empty when no value for text) how would I go about that? [See my desired output columns legs and happiness level] – Charles Mar 31 '22 at 11:40
  • 1
    [replace_nan_with_empty](https://stackoverflow.com/questions/26837998/pandas-replace-nan-with-blank-empty-string) – Mahrkeenerh Mar 31 '22 at 11:41