0

I'm trying to get the best performance for building a large XML file in Python 2/Django.

The final XML file is ~500mb. The 1st approach used was with lxml, but it took over 3.5 hours. I tested with xml.sax (XMLGenerator) and took about the same amount of time, 3.5 hours.

I'm trying to find the fastest way with the least memory consumption. I searched for several days to find the best solutions but had no success.

lxml code:

from lxml import etree

tree_var = etree.Element("tree_var", version='1.2')
DATE = etree.SubElement(DATETIME, "DATE")
DATE.text = datetime.date.today().strftime('%Y-%m-%d')
products = FromModel.objects.all().values_list('product_id')
for product in products:
if product.state == 'new':
    ARTICLE = etree.SubElement(tree_var, "ARTICLE", mode=product.state)

XMLGenerator code:

from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl

with open("tmp/" + filename + ".xml", 'wb') as out:

    g = XMLGenerator(out, encoding='utf-8')
    g.startDocument()

    def start_tag(name, attr={}, body=None, namespace=None):
        attr_vals = {}
        attr_keys = {}
        for key, val in attr.iteritems():
            key_tuple = (namespace, key)
            attr_vals[key_tuple] = val
            attr_keys[key_tuple] = key

        attr2 = AttributesNSImpl(attr_vals, attr_keys)
        g.startElementNS((namespace, name), name, attr2)
        if body:
            g.characters(body)

    def end_tag(name, namespace=None):
        g.endElementNS((namespace, name), name)

    def tag(name, attr={}, body=None, namespace=None):
        start_tag(name, attr, body, namespace)
        end_tag(name, namespace)

g.endDocument()

i'm pretty sure that xml.sax is using less memory, and its increasing the file in realtime. By other side, the lxml only create the file in the end of the loop, using a huge buffer.

Any ideas for help?

Thanks!

xampione
  • 71
  • 1
  • 6
  • take a look to this https://docs.djangoproject.com/en/2.0/howto/outputting-csv/#streaming-large-csv-files I know it about csv file but you can change things for xml file – Reidel Mar 13 '18 at 12:46
  • Take a look at the `XMLGenerator` class. You can find an example of how to use it [here](http://www.xml.com/pub/a/2003/03/12/py-xml.html). – Josh Voigts Mar 13 '18 at 13:42
  • @Reidel: ok i will check that one. Thanks I updated my thread with my code example. – xampione Mar 13 '18 at 14:35
  • another resource maybe useful https://d.cxcore.net/Python/Python_Cookbook_3rd_Edition.pdf and look in the part `12.8. Performing Simple Parallel Programming` in the `CHAPTER 12 Concurrency` – Reidel Mar 13 '18 at 20:30
  • Thanks for the tip @Reidel I created a new code, where i improved from 3h30 to less than 1hour (tested with 8 and 2 cores). there he is: `import codecs from multiprocessing.dummy import Pool, cpu_count def do_work(products): def parallel(result=None): pool = Pool(cpu_count()-1) # to prevent GIL with codecs.open(filename.xml", 'w+', "utf-8") as fp: pool.map(do_work, loop_object) pool.close() pool.join() parallel()` i think its possible to improve the memory, but i'm happy with the time i saved creating the XML. – xampione Mar 20 '18 at 10:45

1 Answers1

0

You may still be loading the whole input file when reading. You can try something along the lines below for an example of reading a file incrementally. This stackoverflow question also provides a little more context than the snippet below. You can also specify a target element in iterparse(), so that you can dump the whole target element at once.

for event, elem in etree.iterparse(in_file, events=('start', 'end',)):
    if event == 'start':
        start_tag(elem.tag, {}, elem.text)
    elif event == 'end':
        end_tag(elem.tag)

    # It's safe to call clear() here because no descendants will be accessed
    elem.clear()

    # Also eliminate now-empty references from the root node to elem
    while elem.getprevious() is not None:
        del elem.getparent()[0]
Josh Voigts
  • 4,114
  • 1
  • 18
  • 43
  • Thanks but isn't that an example for READING? i use that event-way (start/end) with iterparse to read incrementally. i think this doesn't work for writing... – xampione Mar 14 '18 at 15:21
  • You would read and write at the same time, the calls to `start_tag()` and `end_tag()` would be writing. If you give an xml sample, I might be able to be a little more specific. – Josh Voigts Mar 14 '18 at 15:22
  • i see.. need to try. thanks for the feedback. soon i will put here the best testing results for archive :) – xampione Mar 14 '18 at 15:25
  • Ok, im struggling here. a simple xml structure could be this one: ` <![CDATA[_using_CDATA_example_]]> ` im trying to put 2 examples to compare: one with the ProcessPoolExecutor from concurrent.futures (seems this is a Python3 lib) and the example you gave me, since the start/end is useful to parse, so might be to write ( with the clear() ) – xampione Mar 14 '18 at 17:12
  • Is there any structure that repeats itself? I mean that happens multiple times. – Josh Voigts Mar 14 '18 at 17:18
  • yeah, like: `
    <![CDATA[_using_CDATA_example_]]>
    <![CDATA[_using_CDATA_example_]]>
    ` some items can repeat, some can be empty, and some can have attributes (like the mode/version)
    – xampione Mar 14 '18 at 17:22
  • tried to use the multiprocessing Process but i just created over 700 processes with my i7 cores at 100% :D not having much success from here. im trying several approachs but without white smoke. i used this example: [https://stackoverflow.com/questions/38456458/concurrent-futures-not-parallelizing-write] – xampione Mar 15 '18 at 16:11
  • multiprocessing with Queue and Process didnt worked. 3h30 aswell. the difference between this new try and my old one was that in the previous one i was adding new lines of code in real time inside the foreach, and now with this attempt, the XML file is being constructed (prolly is the Queue working). but, i cant improve neither the waiting time or the memory used. dunno what i can do more :\ – xampione Mar 19 '18 at 09:30
  • Can you provide a more complete example of your XML? Sorry to keep asking, it's still hard to understand the structure your trying to parse. – Josh Voigts Mar 19 '18 at 13:53
  • I improved from 3h30 to < 1h with multiprocessing.dummy.Pool() and using cpu_count-1 to prevent GIL. But @Josh, i'm trying to CREATE the XML, not parsing it. i can add here 2 links, one with the code i produced (i bet it can be improved. im php dev, trying to make some things on python/django :D) and another one with the XML structure. can i use external links to help, as pastebin or similar? – xampione Mar 20 '18 at 10:51
  • im putting here the way i did this: - multiprocessing.dummy.Pool + bulk_create (atm for 200k results) - I used only 1 core, because using multiprocessing with bulk was making me duplicating pk ids at the database. (dunno how to fix this) CODE: [https://pastebin.com/jHSXXbfV] and the structure is something like this: XML: [https://pastebin.com/cSLc1ttJ] i have 3 months of python, so prolly i have some noob code mistakes :D - this took me <3200s, against the previous 15000+ using lxml.etree+bulk_create If you think i can improve my code, i will appreciate :) – xampione Mar 22 '18 at 16:39
  • Are you able to [profile](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) it? To find the slow points? – Josh Voigts Mar 22 '18 at 18:16
  • ok got it. first i did a LIMIT 500 to my 200k sql query, and got this result: `All work finished in 8.51s | 215109 function calls (211346 primitive calls) in 8.626 seconds` eheh... this seems ALOT :D – xampione Mar 26 '18 at 11:23
  • now, for the full SQL: `All work finished in 3274.67s 4143799 function calls (4068833 primitive calls) in 3300.406 seconds` – xampione Mar 26 '18 at 12:44