I'm trying to get the best performance for building a large XML file in Python 2/Django.
The final XML file is ~500mb. The 1st approach used was with lxml, but it took over 3.5 hours. I tested with xml.sax (XMLGenerator) and took about the same amount of time, 3.5 hours.
I'm trying to find the fastest way with the least memory consumption. I searched for several days to find the best solutions but had no success.
lxml code:
from lxml import etree
tree_var = etree.Element("tree_var", version='1.2')
DATE = etree.SubElement(DATETIME, "DATE")
DATE.text = datetime.date.today().strftime('%Y-%m-%d')
products = FromModel.objects.all().values_list('product_id')
for product in products:
if product.state == 'new':
ARTICLE = etree.SubElement(tree_var, "ARTICLE", mode=product.state)
XMLGenerator code:
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl
with open("tmp/" + filename + ".xml", 'wb') as out:
g = XMLGenerator(out, encoding='utf-8')
g.startDocument()
def start_tag(name, attr={}, body=None, namespace=None):
attr_vals = {}
attr_keys = {}
for key, val in attr.iteritems():
key_tuple = (namespace, key)
attr_vals[key_tuple] = val
attr_keys[key_tuple] = key
attr2 = AttributesNSImpl(attr_vals, attr_keys)
g.startElementNS((namespace, name), name, attr2)
if body:
g.characters(body)
def end_tag(name, namespace=None):
g.endElementNS((namespace, name), name)
def tag(name, attr={}, body=None, namespace=None):
start_tag(name, attr, body, namespace)
end_tag(name, namespace)
g.endDocument()
i'm pretty sure that xml.sax is using less memory, and its increasing the file in realtime. By other side, the lxml only create the file in the end of the loop, using a huge buffer.
Any ideas for help?
Thanks!