2

I am parsing some XML and write data to different files depending on the XML element that is currently being processed. Processing an element is really fast, and writing the data is, too. Therefore, files would need to open and close very often. For example, given a huge file:

for _, node in lxml.etree.iterparse(file):
    with open(f"{node.tag}.txt", 'a') as fout:
        fout.write(node.attrib['someattr']+'\n'])

This would work, but relatively speaking it would take a lot of time opening and closing the files. (Note: this is a toy program. In reality the actual contents that I write to the files as well as the filenames are different. See the last paragraph for data details.)

An alternative could be:

fhs = {}
for _, node in lxml.etree.iterparse(file):
    if node.tag not in fhs:
        fhs[node.tag] = open(f"{node.tag}.txt", 'w')

    fhs[node.tag].write(node.attrib['someattr']+'\n'])

for _, fh in fhs.items(): fh.close()

This will keep the files open until the parsing of XML is completed. There is a bit of lookup overhead, but that should be minimal compared to iteratively opening and closing the file.

My question is, what is the downside of this approach, performance wise? I know that this will make the open files inaccessible by other processes, and that you may run into a limit of open files. However, I am more interested in performance issues. Does keeping all file handles open create some sort of memory issues or processing issues? Perhaps too much file buffering is going on in such scenarios? I am not certain, hence this question.

The input XML files can be up to around 70GB. The number of files generated is limited to around 35, which is far from the limits I read about in the aforementioned post.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • 1
    can you define: *huge* and give an estimate regarding the # outputfiles? – hootnot Jun 25 '18 at 14:54
  • Broadly speaking, avoiding having to make OS system calls will speed up your code whatever else it does—in other words, it would be faster to _not_ open and close the files repeatedly. Whether that matters will depend on what exactly you're doing. Each open file will require some memory, so if there are huge numbers of them open at once, it could become an issue (but that's doubtful, IMO). – martineau Jun 25 '18 at 15:16
  • @hootnot I added a paragraph giving some more data information. – Bram Vanroy Jun 25 '18 at 16:22
  • @martineau Please see my edit. – Bram Vanroy Jun 25 '18 at 16:22
  • 1
    Bram: In that case it doesn't sound to me like you would have any worries about about using too much memory by having/leaving them all open simultaneously. – martineau Jun 25 '18 at 16:28
  • For 35 files, it's a non-issue. (Cf. my updated answer.) – alexis Jun 25 '18 at 19:35

4 Answers4

3

The obvious downsides you have already mentioned, is that there will be a lot of memory required to keep all the file handles open, depending of course on how many files. This is a calculation you have to do on your own. And don't forget the write locks.

Otherwise there isn't very much wrong with it per say, but it would be good with some precaution:

fhs = {}
try:
    for _, node in lxml.etree.iterparse(file):
        if node.tag not in fhs:
            fhs[node.tag] = open(f"{node.tag}.txt", 'w')

        fhs[node.tag].write(node.attrib['someattr']+'\n'])
finally:
    for fh in fhs.values(): fh.close()

Note: When looping over a dict in python, the items you get are really only the keys. I'd recommend doing for key, item in d.items(): or for item in d.values():

Pax Vobiscum
  • 2,551
  • 2
  • 21
  • 32
  • 1
    Are the open files kept in memory? I would assume the file _handles_ are kept in memory, not the complete files themselves. – Bram Vanroy Jun 25 '18 at 14:56
  • No, the content is not kept in memory, unless you do a read each time you open them. – Pax Vobiscum Jun 25 '18 at 14:57
  • 1
    @martineau Correct, meant to say amount of files, not the content. – Pax Vobiscum Jun 25 '18 at 14:59
  • All that's needed to close all the opened files are the handle values in the dictionary, so the `finally:` clause could simply be: `for fh in fhs.values():`, `fh.close()`—the `fh` variable in your code is _not_ a file handle, so using that as its name in it is somewhat misleading, IMO. – martineau Jun 25 '18 at 15:07
  • @martineau, but then again, it's not my code, it's the original posted by OP, which I made the most _minimal_ modification to in order for it to work. _Then_ we could start to argue over what is more intuitive :) – Pax Vobiscum Jun 25 '18 at 15:14
  • 1
    The `for fh in fhs: fh.close()` in the OP's code is also wrong, but not in exactly the same way. I think my suggestion to simplify and correctly uses the variable name would be the better approach. – martineau Jun 25 '18 at 15:19
  • @martineau But my code wasn't wrong, only unintuitive. In fact, maybe even faster because `.values()` and `.items()` both produce new lists to iterate over. I suppose the only "right" way to do it is with `.itervalues()`. – Pax Vobiscum Jun 25 '18 at 15:26
  • 1
    In Python 3, `values()` returns an iterator (like `intervalues()`—which was removed—did in Python 2). Since we're talking about making a bunch of OS system calls here, it probably isn't worth trying to optimize things at this level-of-detail. – martineau Jun 25 '18 at 15:31
2

You don't didn't say just how many files the process would end up holding open. If it's not so many that it creates a problem, then this could be a good approach. I doubt you can really know without trying it out with your data and in your execution environment.

In my experience, open() is relatively slow, so avoiding unnecessary calls is definitely worth thinking about-- you also avoid setting up all the associated buffers, populating them, flushing them every time you close the file, and garbage-collecting. Since you ask, file pointers do come with large buffers. On OS X, the default buffer size is 8192 bytes (8KB) and there is additional overhead for the object, as with all Python object. So if you have hundreds or thousands of files and little RAM, it can add up. You can specify less buffering or no buffering at all, but that could defeat any efficiency gained from avoiding repeated opens.

Edit: For just 35 distinct files (or any two-digit number), you have nothing to worry about: The space that 35 output buffers will need (at 8 KB per buffer for the actual buffering) will not even be the biggest part of your memory footprint. So just go ahead and do it they way you proposed. You'll see a dramatic speed improvement over opening and closing the file for each xml node.

PS. The default buffer size is given by io.DEFAULT_BUFFER_SIZE.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • The thing I fear with my approach is that because the handle remains open, nothing is flushed at a short iterval (because the file is not closed), and if that is the case for multiple files, all these buffers combined might take up quite some memory. Or is Python/the OS so smart that it will not let the buffer fill memory? (I added some data information in my OP.) – Bram Vanroy Jun 25 '18 at 16:26
  • Output buffers have a fixed, limited length-- that's what a buffer is. They might take up less space until they fill up the first time, but after that the memory footprint will stay constant: The buffer fills up, it gets flushed. However, under some configurations your process might terminate (after everything executes normally) before all open filehandles have been flushed for a final time. So you _should_ close all open filehandles before you end execution, to force flushing. – alexis Jun 25 '18 at 19:25
-1

As a good rule,try to close a file as soon as possible.

Note that also your operating system has limits - you can open only certain number of files. So you might soon hit this limit and you will start getting "Failed to open file" exceptions.

Memory and file handles leaking are obvious problem ( if you fail to close the files for some reason ).

Enthusiast Martin
  • 3,041
  • 2
  • 12
  • 20
  • When you close a file as soon as possible, though, and you have to open it again (and repeat this process > 1mil times) then the time it takes to open the file takes up a lot of your processing time. – Bram Vanroy Jun 25 '18 at 16:27
-2

If you are generating thousands of files the way you might considder writing them to a directory structure to get them separately stored in different directories to have easier access afterwards. For example: a/a/aanode.txt , a/c/acnode.txt, etc.

In case the XML contains consecutive nodes you can write while that condition is True. You only close the moment another node for another file appears. What you gain from it largely depends on the structure of your XML file.

hootnot
  • 1,005
  • 8
  • 13