Extract XML info in a df in Python

Question

I have a problem. In my csv file there is column in XML, like this:

ID       Name       Request

4223    Axery          <Type xmlns="http://data" 
                xmlns:i="http://www.rij3.instance"><Person><City> 
                <Nr>5050</Nr><Description>Big</Description> 
                <Date>2012-10-30T00:00:00Z</Date></City><Details><Name>London</Name> 
                <Account>5050</Account><Date>2019-07-07T00:00:00Z</Date>
                <Status>Open</Status></Details><..............[more info]>
            ....
            </Person></Type>




            
4001    Jix     <Type xmlns="http://data" 
                xmlns:i="http://www.rij3.instance"><Person><City> 
                <Nr>5024</Nr><Description>Big</Description> 
                <Date>2012-10-30T00:00:00Z</Date></City><Details><Name>London</Name> 
                <Account>5024</Account><Date>2019-07-07T00:00:00Z</Date>
                <Status>Open</Status></Details><..............[more info]>
                ....
                </Person></Type>

....

4067    AOe     <Type xmlns="http://data" 
                xmlns:i="http://www.rij3.instance"><Person><City> 
                <Nr>5011</Nr><Description>Big</Description> 
                <Date>2012-10-30T00:00:00Z</Date></City><Details><Name>London</Name> 
                <Account>5011</Account><Date>2019-07-07T00:00:00Z</Date>
                <Status>Open</Status></Details><..............[more info]>
                ....
                </Person></Type>

I want extract XML info. I use Pandas to read my csv file

df = pd.read_csv('my_file.csv', header=0, sep='|', error_bad_lines=False)

I want a final df like this:

**ID     Name   Type  Person City Nr    Description Date ........**

4223    Axery                     5050    Big      2012-10-30T00:00:00Z

Any suggestions? My idea was to work with only the XML columns and 'concat' the result.

Request:

<Type xmlns="http://data" 
 xmlns:i="http://www.rij3.instance">
 <Person>
   <City>
     <Nr>5050</Nr>
     <Description>Big</Description>
     <Date>2012-10-30T00:00:00Z</Date>
   </City>
   <Details>
     <Name>London</Name>
     <Account>5050</Account>
     <Date>2019-07-07T00:00:00Z</Date>
     <Status>Open</Status>
   </Details>
 </Person>
</Type>

I edited the grammar, casing, and formatting of your code to improve your question which was accepted by the moderators. You then rolled it back to the original revision, which in my opinion has low quality. I would suggest you to apply the edits from revision 2 again. — Muhammad Mohsin Khan, Feb 02 '22 at 08:23

score 1 · Answer 1 · answered Jan 19 '22 at 20:09

I see the solution (not knowing Pandas) as iterating rows from the input CSV and creating new columns based on the XML found in that row. To me, this is a row oriented process. I don't know dataframes well, but from what I gather they are column oriented^1.

Here's how I'd do this using Python standard csv module.

Read the CSV, gather all the data

read the input CSV with a DictReader which will convert a row to a dict of values keyed to the names in the CSV's header
iterate the rows:
1. parse the XML field and add the processed values with new keys to the row's dict
2. delete the XML field/key from the row
3. append the row to a list of all my new rows

Write a new CSV

After that, I'll just use the DictWriter to create a new CSV based on the fields I've read/created in the previous steps.

import csv
from xml.etree import ElementTree as ET

# Need this to find your XML elements/nodes by name
ns = {'xmlns': 'http://data'}

new_rows = []

with open('input.csv', newline='') as f:
    reader = csv.DictReader(f)
    
    for row in reader:
        # Parse XML and get City node
        xml_str = row['XML']
        tree = ET.fromstring(xml_str)
        city_node = tree.find('./xmlns:Person/xmlns:City', ns)

        # Get individual City values
        city_nr = city_node.find('xmlns:Nr', ns).text
        city_desc = city_node.find('xmlns:Description', ns).text

        row['City_Nr'] = city_nr
        row['City_Description'] = city_desc

        del(row['XML'])

        new_rows.append(row)


new_fieldnames = new_rows[0].keys()

with open('output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=new_fieldnames)
    writer.writeheader()
    writer.writerows(new_rows)

I mocked up this CSV, input.csv:

ID,Name,XML
4223,Axery,"<Type xmlns=""http://data"" 
 xmlns:i=""http://www.rij3.instance"">
 <Person>
   <City>
     <Nr>5050</Nr>
     <Description>Big</Description>
     <Date>2012-10-30T00:00:00Z</Date>
   </City>
   <Details>
     <Name>London</Name>
     <Account>5050</Account>
     <Date>2019-07-07T00:00:00Z</Date>
     <Status>Open</Status>
   </Details>
 </Person>
</Type>
"

When I run my script on that, I get this for output.csv:

ID,Name,City_Nr,City_Description
4223,Axery,5050,Big

Extract XML info in a df in Python

1 Answers1

Read the CSV, gather all the data

Write a new CSV