Optimizing XML parse into CSV using Python

Question

I have about 10,000 of XML files with similar structure that I wish to convert to a single CSV file. Each XML file looks like this:

<?xml version='1.0' encoding='UTF-8'?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
    <S:Body>
        <ns7:GetStopMonitoringServiceResponse xmlns:ns3="http://www.siri.org.uk/siri" xmlns:ns4="http://www.ifopt.org.uk/acsb" xmlns:ns5="http://www.ifopt.org.uk/ifopt" xmlns:ns6="http://datex2.eu/schema/1_0/1_0" xmlns:ns7="http://new.webservice.namespace">
            <Answer>
                <ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
                <ns3:ProducerRef>ISR Siri Server (141.10)</ns3:ProducerRef>
                <ns3:ResponseMessageIdentifier>276480603</ns3:ResponseMessageIdentifier>
                <ns3:RequestMessageRef>0100700:1351669188:4684</ns3:RequestMessageRef>
                <ns3:Status>true</ns3:Status>
                <ns3:StopMonitoringDelivery version="IL2.71">
                    <ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
                    <ns3:Status>true</ns3:Status>
                    <ns3:MonitoredStopVisit>
                        <ns3:RecordedAtTime>2019-03-31T09:00:52.000+03:00</ns3:RecordedAtTime>
                        <ns3:ItemIdentifier>-881202701</ns3:ItemIdentifier>
                        <ns3:MonitoringRef>20902</ns3:MonitoringRef>
                        <ns3:MonitoredVehicleJourney>
                            <ns3:LineRef>23925</ns3:LineRef>
                            <ns3:DirectionRef>2</ns3:DirectionRef>
                            <ns3:FramedVehicleJourneyRef>
                                <ns3:DataFrameRef>2019-03-31</ns3:DataFrameRef>
                                <ns3:DatedVehicleJourneyRef>36962685</ns3:DatedVehicleJourneyRef>
                            </ns3:FramedVehicleJourneyRef>
                            <ns3:PublishedLineName>15</ns3:PublishedLineName>
                            <ns3:OperatorRef>15</ns3:OperatorRef>
                            <ns3:DestinationRef>26020</ns3:DestinationRef>
                            <ns3:OriginAimedDepartureTime>2019-03-31T08:35:00.000+03:00</ns3:OriginAimedDepartureTime>
                            <ns3:VehicleLocation>
                                <ns3:Longitude>34.78000259399414</ns3:Longitude>
                                <ns3:Latitude>32.042293548583984</ns3:Latitude>
                            </ns3:VehicleLocation>
                            <ns3:VehicleRef>37629301</ns3:VehicleRef>
                            <ns3:MonitoredCall>
                                <ns3:StopPointRef>20902</ns3:StopPointRef>
                                <ns3:ExpectedArrivalTime>2019-03-31T09:03:00.000+03:00</ns3:ExpectedArrivalTime>
                            </ns3:MonitoredCall>
                        </ns3:MonitoredVehicleJourney>
                    </ns3:MonitoredStopVisit>
                </ns3:StopMonitoringDelivery>
            </Answer>
        </ns7:GetStopMonitoringServiceResponse>
    </S:Body>
</S:Envelope>

The example above shows one MonitoredStopVisit nested tag, but every XML have about 4,000 of them. Full XML as an example can be found here.

I want to convert all the 10K files to one CSV where each record corresponds to a MonitoredStopVisit tag, so the CSV should look like this:

Currently this is my architecture:

split the 10K files into 8 chunks (per my PC cores).
Each sub-process iterates through its xml files and objectifies the xml.
The object is then iterated, and per each element I use conditions to exclude/include data using an array.
When the tag is /ns3:MonitoredStopVisit, the array is appended to a pandas dataframe as a series.
When all sub-processes are done, the dataframes are merged and saved as CSV.

This is the xml to df code:

def xml_to_df(xml_file):
    from lxml import objectify
    xml_content = xml_file.read()
    obj = objectify.fromstring(xml_content)
    df_cols=[
        'RecordedAtTime',
        'MonitoringRef',
        'LineRef',
        'DirectionRef',
        'PublishedLineName',
        'OperatorRef',
        'DestinationRef',
        'OriginAimedDepartureTime',
        'Longitude',
        'Latitude',
        'VehicleRef',
        'StopPointRef',
        'ExpectedArrivalTime',
        'AimedArrivalTime'
        ]
    tempdf = pd.DataFrame(columns=df_cols)
    arr_of_vals = [""] * 14

    for i in obj.getiterator():
        if "MonitoredStopVisit" in i.tag or "Status" in i.tag and "false" in str(i):
            if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):
                s = pd.Series(arr_of_vals, index=df_cols)
                if tempdf[(tempdf==s).all(axis=1)].empty:
                    tempdf = tempdf.append(s, ignore_index=True)
                    arr_of_vals =  [""] * 14
        elif "RecordedAtTime" in i.tag:
            arr_of_vals[0] = str(i)
        elif "MonitoringRef" in i.tag:
            arr_of_vals[1] = str(i)
        elif "LineRef" in i.tag:
            arr_of_vals[2] = str(i)
        elif "DestinationRef" in i.tag:
            arr_of_vals[6] = str(i)
        elif "OriginAimedDepartureTime" in i.tag:
            arr_of_vals[7] = str(i)
        elif "Longitude" in i.tag:
            if str(i) == "345353":
                print("Lon: " + str(i))
            arr_of_vals[8] = str(i)
        elif "Latitude" in i.tag:
            arr_of_vals[9] = str(i)
        elif "VehicleRef" in i.tag:
            arr_of_vals[10] = str(i)
        elif "ExpectedArrivalTime" in i.tag:
            arr_of_vals[12] = str(i)

    if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):  
        s = pd.Series(arr_of_vals, index=df_cols)
        if tempdf[(tempdf == s).all(axis=1)].empty:
            tempdf = tempdf.append(s, ignore_index=True)
    return tempdf

The problem is that for 10K files this takes about 10 hours with 8 sub-processors. When checking CPU/Mem usage, I can see that are not fully utilized.

Any idea how this can be improved? My next step is threading, but maybe there are other applicable ways. Just as a note, the order of records isn't important - I can sort it later.

Added a link to a full XML: https://wetransfer.com/downloads/ea6f5a37252cd000ec7e90096112217020190604125623/2a72f4 — Shakedk, Jun 04 '19 at 12:58

score 1 · Answer 1 · answered Jun 05 '19 at 07:51

Here is my solution with pandas:

Computation time for each 5Mb file is about 0.4s

import xml.etree.ElementTree as ET
import re
import pandas as pd
import os



def collect_data(xml_file):
    # create xml object
    root = ET.parse(xml_file).getroot()

    # collect raw data
    out_data = []
    for element in root.iter():
        # get tag name
        tag = re.sub('{.*?}', '', element.tag)
        # add break segment element
        if tag == 'RecordedAtTime':
            out_data.append('break')

        if tag in tag_list:
            out_data.append((tag, element.text))

    # get break indexes
    break_index = [i for i, x in enumerate(out_data) if x == 'break']

    # break list into parts
    list_data = []
    for i in range(len(break_index) - 1):
        list_data.append(out_data[break_index[i]:break_index[i + 1]])

    # check for each value in data
    final_data = []
    for item in list_data:
        # delete bleak element ad convert list into dictionary
        del item[item.index('break')]
        data_dictionary = dict(item)

        if 'RecordedAtTime' in data_dictionary.keys():
            recorded_at_time = data_dictionary.get('RecordedAtTime')
        else:
            recorded_at_time = ''

        if 'MonitoringRef' in data_dictionary.keys():
            monitoring_ref = data_dictionary.get('MonitoringRef')
        else:
            monitoring_ref = ''

        if 'LineRef' in data_dictionary.keys():
            line_ref = data_dictionary.get('LineRef')
        else:
            line_ref = ''

        if 'DirectionRef' in data_dictionary.keys():
            direction_ref = data_dictionary.get('DirectionReff')
        else:
            direction_ref = ''

        if 'PublishedLineName' in data_dictionary.keys():
            published_line_name = data_dictionary.get('PublishedLineName')
        else:
            published_line_name = ''

        if 'OperatorRef' in data_dictionary.keys():
            operator_ref = data_dictionary.get('OperatorRef')
        else:
            operator_ref = ''

        if 'DestinationRef' in data_dictionary.keys():
            destination_ref = data_dictionary.get('DestinationRef')
        else:
            destination_ref = ''

        if 'OriginAimedDepartureTime' in data_dictionary.keys():
            origin_aimed_departure_time = data_dictionary.get('OriginAimedDepartureTime')
        else:
            origin_aimed_departure_time = ''

        if 'Longitude' in data_dictionary.keys():
            longitude = data_dictionary.get('Longitude')
        else:
            longitude = ''

        if 'Latitude' in data_dictionary.keys():
            latitude = data_dictionary.get('Latitude')
        else:
            latitude = ''

        if 'VehicleRef' in data_dictionary.keys():
            vehicle_ref = data_dictionary.get('VehicleRef')
        else:
            vehicle_ref = ''

        if 'StopPointRef' in data_dictionary.keys():
            stop_point_ref = data_dictionary.get('StopPointRef')
        else:
            stop_point_ref = ''

        if 'ExpectedArrivalTime' in data_dictionary.keys():
            expected_arrival_time = data_dictionary.get('ExpectedArrivalTime')
        else:
            expected_arrival_time = ''

        if 'AimedArrivalTime' in data_dictionary.keys():
            aimed_arrival_time = data_dictionary.get('AimedArrivalTime')
        else:
            aimed_arrival_time = ''

        final_data.append((recorded_at_time, monitoring_ref, line_ref, direction_ref, published_line_name, operator_ref,
                       destination_ref, origin_aimed_departure_time, longitude, latitude, vehicle_ref,
                       stop_point_ref,
                       expected_arrival_time, aimed_arrival_time))

     return final_data


# setup tags list for checking
tag_list = ['RecordedAtTime', 'MonitoringRef', 'LineRef', 'DirectionRef', 'PublishedLineName', 'OperatorRef',
            'DestinationRef', 'OriginAimedDepartureTime', 'Longitude', 'Latitude', 'VehicleRef', 'StopPointRef',
            'ExpectedArrivalTime', 'AimedArrivalTime']

# collect data from each file
save_data = []
for file_name in os.listdir(os.getcwd()):
    if file_name.endswith('.xml'):
        save_data.append(collect_data(file_name))
    else:
        pass

# merge list of lists
flat_list = []
for sublist in save_data:
    for item in sublist:
        flat_list.append(item)

# load data into data frame
data = pd.DataFrame(flat_list, columns=tag_list)

# save data to file
data.to_csv('data.csv', index=False)

Did you happen to try it for 10K files? Just out of curiosity — Shakedk, Jun 05 '19 at 08:14
@Shakedk I'm tested on 1000 files and average time for one files is: 0.59 sek. I think you can modify my code to get faster time. — Zaraki Kenpachi, Jun 05 '19 at 08:47

Shakedk · Answer 2 · 2019-06-05T07:41:49.780

0

So it seems the issue is the use of the Pandas dataframe and series. Using the code above, processing one xml file with ~4000 records took 4-120 seconds. The time increased as the program kept working.

Using python lists or numpy matrices (more convenient for working into a csv) decreased the running time significantly - each xml file processing now takes 0.1-0.5 seconds tops.

I used the following code to append the new processed records each time

records = np.append(records, new_redocrds, axis=0)

This is equivalent to:

tempdf = tempdf.append(s, ignore_index=True)

but significantly faster.

Hope this helps anyone who might encounter similar issues!

edited Jun 05 '19 at 07:41

answered Jun 04 '19 at 20:26

Shakedk

420
6
15

Of course your Pandas code increases with time as you are [appending a data frame in a loop](https://stackoverflow.com/a/36489724/1422451) which you should avoid since it leads to quadratic copying. – Parfait Jun 04 '19 at 21:47
I'm doing the same with numpy arrays: 'bs_records = np.append(bs_records, records, axis=0)' and it is faster by far as I mentioned... Maybe for some people it's obvious, but for people not used to pandas/numpy this can be helpful to know. – Shakedk Jun 05 '19 at 07:31
Look into XSLT which can transform XML to CSV. No need for appending lists, matrices, or data frames. – Parfait Jun 05 '19 at 16:33

score 0 · Answer 3 · answered Jun 05 '19 at 16:31

Actually consider XSLT, the special-purpose language to transform XML files into other XML even text files such as CSV. The only third-party library needed is Python's lxml which can run XSLT 1.0 scripts leaving out the heavier, extensive analytical tools such as Pandas and Numpy.

In fact, because XSLT is a separate, industry language, it is portable and can be run in any language with XSLT library (e.g., Java, PHP, Perl, C#, VB) or standalone 1.0, 2.0, or 3.0 processors (e.g., Xalan, Saxon), all of which Python can call as a command line subprocess.

XSLT (save below as a .xsl file, a special .xml file)

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:ns3="http://www.siri.org.uk/siri" 
                              xmlns:ns4="http://www.ifopt.org.uk/acsb" 
                              xmlns:ns5="http://www.ifopt.org.uk/ifopt" 
                              xmlns:ns6="http://datex2.eu/schema/1_0/1_0" 
                              xmlns:ns7="http://new.webservice.namespace">

   <xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>
   <xsl:strip-space elements="*"/>

   <xsl:template match ="/S:Envelope/S:Body/ns7:GetStopMonitoringServiceResponse/Answer">
       <xsl:apply-templates select="ns3:StopMonitoringDelivery"/>
   </xsl:template>

   <xsl:template match="ns3:StopMonitoringDelivery">
        <!-- HEADERS -->
        <!-- <xsl:text>RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime&#xa;</xsl:text> -->
        <xsl:apply-templates select="ns3:MonitoredStopVisit"/>
        <xsl:text>&#xa;</xsl:text>
   </xsl:template>

   <xsl:template match="ns3:MonitoredStopVisit">
       <xsl:variable name="delim">,</xsl:variable>
       <xsl:variable name="quote">&quot;</xsl:variable>
       <!-- DATA ROWS -->
       <xsl:value-of select="concat($quote, ns3:RecordedAtTime, $quote, $delim,
                                    $quote, ns3:MonitoringRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:LineRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:DirectionRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:PublishedLineName, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:OperatorRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:DestinationRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:OriginAimedDepartureTime, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Longitude, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Latitude, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:StopPointRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:ExpectedArrivalTime, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:AimedArrivalTime, $quote, $delim
                                    )"/>
   </xsl:template>

</xsl:stylesheet>

Online Demo

Python (no appending lists, arrays, or dataframes)

import glob                 # TO RETRIEVE ALL XML FILES
import lxml.etree as et     # TO PARSE XML AND RUN XSLT

xml_path = "/path/to/xml/files"

# PARSE XSLT
xsl = et.parse('XSLTScript.xsl')

# BUILD CSV
with open("MonitoredStopVisits.csv", 'w') as f:
    # HEADER
    f.write('RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,'
            'OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,'
            'VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime\n')

    # DATA ROWS
    for f in glob.glob(xml_path + "/**/*.xml", recursive=True):
        # LOAD XML AND XSL SCRIPT
        xml = et.parse(f)

        # TRANSFORM XML TO STRING RESULT TREE
        transform = et.XSLT(xsl)
        result = str(transform(xml))

        # WRITE TO CSV
        f.write(result)

Optimizing XML parse into CSV using Python

3 Answers3