Iterating through XMLs, making dataframes from nodes and merging them with a master dataframe. How should I optimize this code?

Question

I'm trying to iterate through a lot of xml files that have ~1000 individual nodes that I want to iterate through to extract specific attributes (each node has 15 or so attributes, I only want one). In the end, there should be about 4 million rows. My code is below, but I have a feeling that it's not time efficient. What can I optimize about this?

import os, pandas as pd, xml.etree.ElementTree as xml

#init master df as accumulator of temp dfs
master_df = pd.DataFrame(
    columns = [
        'col1',
        'col2',
        'col3',
        'col4'
        ])
dir = 'C:\\somedir'

#iterate through files
for file in os.listdir(dir):
    #init xml handle and parse
    file = open(str(dir+"{}").format('\\'+file)
    parse = xml.parse(file)
    root = parse.getroot()
    
    #var assignments with desired data
    parent_node1 = str(root[0][0].get('pn1'))
    parent_node2 = str(root[0][1].get('pn2'))
    
    #resetting iteration dependent variables
    count = 0
    a_dict = {}
    
    #iterating through list of child nodes
    for i in list(root[1].iter())[1:]:
        child_node1 = str(i.get('cn1'))
        child_node2 = str(i.get('cn2'))
        a_dict.update({
            count: {
                "col1" : parent_node1,
                'col2': child_node1,
                "col3": parent_node2,
                "col4" : child_node2
                }})
        count = count+1
    temp_df = pd.DataFrame(a_dict).T
    master_df = pd.merge(
        left = master_df,
        right = temp_df,
        how = 'outer'
        )

@mzjn when I iterate over 20 xmls, takes 5 min, when I moved it out of testing env to iterate over 150, takes 1 hr + — Carter Canedy, May 10 '22 at 19:45
Sorry that I can't be more specific. I think it probably has to do with all of the comparison operations that are being done for each merge. I've been trying to tinker with using nvidia rapids/cudf in wsl2 and trying to utilize parallelization for this, but no luck in getting a usable environment. — Carter Canedy, May 10 '22 at 19:52
Please show sample XML. You could be missing a golden opportunity to use the new IO method [**`pandas.read_xml`**](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html)! — Parfait, May 10 '22 at 23:03

score 0 · Accepted Answer · answered May 10 '22 at 21:52

0

instead of initializing intermediate dataframes that are constantly being merged, I used nested lists, much faster under the hood and since I'm not expecting to handle any irregular data sets it should be fine. Otherwise, all other code is the same for parsing xml.

answered May 10 '22 at 21:52

Carter Canedy

73
1
9

1

Indeed. Never call `append`, `merge`, `concat` or other growth operations *inside* a for-loop. This leads to [quadratic copying](https://stackoverflow.com/a/36489724/1422451). – Parfait May 10 '22 at 22:59

Iterating through XMLs, making dataframes from nodes and merging them with a master dataframe. How should I optimize this code?

1 Answers1