I am parsing some XML and write data to different files depending on the XML element that is currently being processed. Processing an element is really fast, and writing the data is, too. Therefore, files would need to open and close very often. For example, given a huge file
:
for _, node in lxml.etree.iterparse(file):
with open(f"{node.tag}.txt", 'a') as fout:
fout.write(node.attrib['someattr']+'\n'])
This would work, but relatively speaking it would take a lot of time opening and closing the files. (Note: this is a toy program. In reality the actual contents that I write to the files as well as the filenames are different. See the last paragraph for data details.)
An alternative could be:
fhs = {}
for _, node in lxml.etree.iterparse(file):
if node.tag not in fhs:
fhs[node.tag] = open(f"{node.tag}.txt", 'w')
fhs[node.tag].write(node.attrib['someattr']+'\n'])
for _, fh in fhs.items(): fh.close()
This will keep the files open until the parsing of XML is completed. There is a bit of lookup overhead, but that should be minimal compared to iteratively opening and closing the file.
My question is, what is the downside of this approach, performance wise? I know that this will make the open files inaccessible by other processes, and that you may run into a limit of open files. However, I am more interested in performance issues. Does keeping all file handles open create some sort of memory issues or processing issues? Perhaps too much file buffering is going on in such scenarios? I am not certain, hence this question.
The input XML files can be up to around 70GB. The number of files generated is limited to around 35, which is far from the limits I read about in the aforementioned post.