Merging Lots of XML files

Question

I have lots of xml files that I need to merge. I have tried this link at merging xml files using python's ElementTree whose code is (Edited as per my need):

import os, os.path, sys
import glob
from xml.etree import ElementTree

def run(files):
    xml_files = glob.glob(files +"/*.xml")
    xml_element_tree = None
    for xml_file in xml_files:
        print xml_file
        data = ElementTree.parse(xml_file).getroot()
        # print ElementTree.tostring(data)
        for result in data.iter('TALLYMESSAGE'):
            if xml_element_tree is None:
                xml_element_tree = data 
                insertion_point = xml_element_tree.findall("./BODY/DATA/TALLYMESSAGE")[0]
            else:
                insertion_point.extend(result) 
    if xml_element_tree is not None:
        f =  open("myxmlfile.xml", "wb")
        f.write(ElementTree.tostring(xml_element_tree))
run("F:/data/data")

But the problem is that I have lots of XML file, 365 to be precise and each one is atleast 2 mb. merging them all has lead to crashing of my PC. This is the image of the xml tree of my xml file:

My new updated code is:

import os, os.path, sys
import glob
from lxml import etree
def XSLFILE(files):
    xml_files = glob.glob(files +"/*.xml")
    #print xml_files[0]
    xslstring = """<?xml version="1.0" ?> 
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 
<xsl:template match="/DATA">
<DATA>
<xsl:copy>
<xsl:copy-of select="TALLYMESSAGE"/>\n"""
    #print xslstring
    for xmlfile in xml_files[1:]:
        xslstring = xslstring + '<xsl:copy-of select="document(\'' + xmlfile[-16:] + "')/BODY/DATA/TALLYMESSAGE\"/>\n"
    xslstring = xslstring + """</xsl:copy>+
</DATA>
</xsl:template> 
</xsl:transform>"""
    #print xslstring
    with open("parsingxsl.xsl", "w") as f:
        f.write(xslstring)
    with open(xml_files[0], "r") as f:
        dom = etree.XML(f.read())
    print etree.tostring(dom)
    with open('F:\data\parsingxsl.xsl', "r") as f:
        xslt_tree = etree.XML(f.read())
    print xslt_tree
    transform = etree.XSLT(xslt_tree)
    newdom = transform(dom)
    #print newdom
    tree_out = etree.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
    print(tree_out)

    xmlfile = open('F:\data\OutputFile.xml','wb')
    xmlfile.write(tree_out)
    xmlfile.close()
XSLFILE("F:\data\data")

The same when run creates the following error:

Traceback (most recent call last):
  File "F:\data\xmlmergexsl.py", line 38, in <module>
    XSLFILE("F:\data\data")
  File "F:\data\xmlmergexsl.py", line 36, in XSLFILE
    xmlfile.write(tree_out)
TypeError: must be string or buffer, not None

Parfait · Accepted Answer · 2015-09-19T18:10:01.483

2

Consider using XSLT and its document() function to merge XML files. Python (like many object-oriented programming languages) maintain an XSLT processor like in its lxml module. As information, XSLT is a declarative programming language to transform XML files in various formats and structures.

For your purposes, XSLT may be more efficient than using programming code to develop files as no lists or loops or other objects are held in memory during processing except what the XSLT processor would use.

XSLT (to be saved externally as .xsl file)

Consider initially running a Python write to text file looping to fill in all 365 documents to avoid copy and paste. Also notice first document is skipped since it is the starting point used in Python script below:

<?xml version="1.0" ?> 
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 

 <xsl:template match="DATA">
  <DATA>
    <xsl:copy> 
       <xsl:copy-of select="TALLYMESSAGE"/>
       <xsl:copy-of select="document('Document2.xml')/BODY/DATA/TALLYMESSAGE"/>
       <xsl:copy-of select="document('Document3.xml')/BODY/DATA/TALLYMESSAGE"/>
       <xsl:copy-of select="document('Document4.xml')/BODY/DATA/TALLYMESSAGE"/>
       ...
       <xsl:copy-of select="document('Document365.xml')/BODY/DATA/TALLYMESSAGE"/>             
    </xsl:copy>
  </DATA>
 </xsl:template> 

</xsl:transform>

Python (to be included in you overall script)

import lxml.etree as ET

dom = ET.parse('C:\Path\To\XML\Document1.xml')
xslt = ET.parse('C:\Path\To\XSL\file.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
print(tree_out)

xmlfile = open('C:\Path\To\XML\OutputFile.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

edited Sep 19 '15 at 18:10

answered Sep 18 '15 at 14:33

Parfait

104,375
17
94
125

This is the error I'm getting.: Traceback (most recent call last): File "F:\data\xmlmergexsl.py", line 32, in XSLFILE("F:\data\data") File "F:\data\xmlmergexsl.py", line 30, in XSLFILE xmlfile.write(tree_out) TypeError: must be string or buffer, not None – Smit Sanghvi Sep 19 '15 at 05:17
See updated XSLT. where a root node is explicitly declared. It would be helpful to see a sample of XML file. Also, be sure all XML files are in same directory. – Parfait Sep 19 '15 at 05:37
I have added the structure image of my xml file. each of them has this basic struture exactly same. the only differences are in voucher element. – Smit Sanghvi Sep 19 '15 at 06:11
Try updated XSLT once more. Paths are consistent with your structure. – Parfait Sep 19 '15 at 18:10
The program ran, atlast, but my initial problem still persists, the RAM fills up completely and then the console is forced closed. – Smit Sanghvi Sep 20 '15 at 05:50
Well, I have found a turnaround to run the xsl file, aparently it output is just: – Smit Sanghvi Sep 20 '15 at 06:03
To read from file use [lxml's](http://lxml.de/parsing.html) `parse()` (no need to open and read from file) and if an embedded string use lxml's `fromstring()`. Test out RAM performance with few handful of XML files before running all 365. Might have to do in batches and merge the merged ones (i.e., merge 3 XMLs with 122 TallyMessage nodes, 6 XMLs w/ 61 nodes, 12 w/ 30). – Parfait Sep 20 '15 at 14:08
Alternatively, you could run a smaller transform on original Envelope root XML files if they tend to be larger as the XSLT processor must read all content into memory. This precursor XSLT looped through all 365 in Python will just copy the Tallymessage node (one line inside template match): ``. Then, apply the document merge XSLT on these smaller XML ones (of course, remove `BODY` in XPath). – Parfait Sep 20 '15 at 14:25
The batch one seems to be the way. I wrote script to remove all blank and useless tags. makes the whole work easier and faster, thanks by the way. – Smit Sanghvi Sep 21 '15 at 04:40

Merging Lots of XML files

1 Answers1

Linked