The problem is that on this line:
outF = open("C:\\Users\\####\\Desktop\\bnc.txt", "w")
The same file is opened and closed over and over again.
Behind the scenes:
When you call open
, the Python interpreter makes a system call to the operating system, asking the OS to look for the file with that name and return an integer (called a "file descriptor" or "FD") that refers to the file. If the system call succeeds, then the interpreter receives a FD, stores the FD in a new Python object, and returns that object from the open
function.
When you call write
, the interpreter takes your string and stores it in an internal buffer. When the buffer fills up, or when the outF
object is destroyed (as we will see), the interpreter makes a system call asking the OS to write the contents of the buffer to the file that the FD refers to.
When there are no more references to a Python object, the interpreter is free to garbage collect it. But first, the interpreter needs to internally call the object's __del__
method, a.k.a. the object's destructor. A file object's destructor makes a final system call to tell the OS "I don't need this FD anymore, and you can close the file."
This next part is subtle. open
creates and returns a new object (we'll call it f1); outF = open(...)
assigns the identifier outF
to f1. f1's reference count (the amount of identifiers assigned to it) is now 1. On the next iteration of outF = open(...)
, you're telling the interpreter that you no longer want outF
to refer to f1. f1's reference count drops to 0, allowing the garbage collector to destroy the object and close the file. This new call to open
returns a new object (call it f2) that just so happens to refer to the file that was just closed. outF
is assigned to f2, and f2's reference count is now 1.
There is no need to open and close the file over and over again. I recommend opening it before the loop:
#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-A0B]*.xml")
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as outF:
for filename in filenames:
with open(filename, 'r', encoding="utf-8") as content:
tree = ET.parse(content)
root = tree.getroot()
for w in root.iter('w'):
lemma = w.get('hw')
pos = w.get('pos')
tag = w.get('c5')
outF.write(w.text + "," + lemma + "," + pos + "," + tag)
outF.write("\n")
This has two advantages over building a list within the loop and then opening the file after the loop. This method iterates once instead of twice, and it requires a constant amount of space within the program's memory space (the constant size of the output buffer) instead of an amount of space that grows.