0

I am using xsl file to merge multiple xml files. The number of files is around 100 and each file has 4000 nodes. The example xml and xsl are available here in this SO question

My xmlmerge.py is as follows:

import lxml.etree as ET
import argparse
import os
ap = argparse.ArgumentParser()
ap.add_argument("-x", "--xmlreffile", required=True, help="Path to list of xmls")
ap.add_argument("-s", "--xslfile", required=True, help="Path to the xslfile")
args = vars(ap.parse_args())    
dom = ET.parse(args["xmlreffile"])
xslt = ET.parse(args["xslfile"])
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(ET.tostring(newdom, pretty_print=True))   

I am writing the output of the python to a xmlfile...so my code to run the python script is as follows:

python xmlmerge.py --xmlreffile ~/Documents/listofxmls.xml --xslfile ~/Documents/xslfile.xsl

For 100 files when I print the output on a console, it takes around 120 minutes however, if I try to save the same output in a xml file

python xmlmerge.py --xmlreffile ~/Documents/listofxmls.xml --xslfile ~/Documents/xslfile.xsl >> ~/Documents/mergedxml.xml

This takes around 3 days but yet the merge is not over. I was not sure if the machine is hung and hence tried with just 8 files on a different machine, and it had taken more than 4 hours but still the merge is not complete. I don't know why it takes so much of time when I write to the file but not when I am printing on to the console. Can someone guide me?

I am using Ubuntu 14.04, python 2.7.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
popeye
  • 281
  • 5
  • 20

1 Answers1

0

Why don't you make a multi-processing version of your script. There is several ways you could do it but, from what I understand, here is what I would do

list = open("listofxmls.xml","r")# assuming this gives you a list of files (adapt if necessary)

yourFunction(xml):
    steps 
    of your
    parse funct
    return(ET.tostring(newdom, pretty_print=True))

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4) # number of threads (adapt depending on the task and your CPU)
mergedXML = pool.map(yourFunction,list) # execute the function in parallel
pool.close()
pool.join()

then, save your mergedXML as you like.

Hope it helps or, at least, lead u in the right direction

zar3bski
  • 2,773
  • 7
  • 25
  • 58