0

My python lxml tree expands to 5GB when I serialize it with the toString() method. The Linux OS kills the process because it runs out of memory.

Technically there no need to create the complete xml in memory since its written to a zip archive right away.

Is there a way to serialize the tree as a stream to a zip archive?

Here is my current code (snippet):

import zipfile
from lxml import etree as ET

# Create a zipfile archive
zip_out = zipfile.ZipFile('outputfile.zip', 'w', compression=zipfile.ZIP_DEFLATED)
# serialize lxml etree to string and write to archive
zip_out.writestr('treefile.xml', large_etree.tostring())

One way could be to write the etree to a tmp file and then write that file to the archive. Not a great workaround and probably also slow.

Joachim
  • 151
  • 7
  • Doesn't sound like too bad a workaround to me, especially if you are running on a SSD. A workaround is better than no workaround for sure. It may also be all you have. I've run into the problem before that the zipfile module won't allow you to pass it a file-like object representing an already open steam of data. - I recall googling for another available zip library that might allow that, but I don't think I found anything. – CryptoFool Feb 26 '19 at 01:20
  • Should be possible to use `.writestr(..., .tostring()`. Relevant [get-all-text-inside-a-tag-in-lxml](https://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml/28173933#28173933). Extra handlling for the header `'...` are required. – stovfl Feb 26 '19 at 07:44

0 Answers0