2

I'm successfully loading and outputting the way I want except that each new write loop is overwriting previous instead of appending, such that I am left with only the data from the last file in the loop.

#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-A0B]*.xml")
for filename in filenames:
    with open(filename, 'r', encoding="utf-8") as content:
        tree = ET.parse(content)
        root = tree.getroot()
        outF = open("C:\\Users\\####\\Desktop\\bnc.txt", "w")
        for w in root.iter('w'):
            lemma = w.get('hw')
            pos = w.get('pos')
            tag = w.get('c5')

            outF.write(w.text + "," + lemma + "," + pos + "," + tag)
            outF.write("\n")

Example:

File 1 - a,b,c,d

File 2 - e,f,g,h

Desired Output:

a,b,c,d

e,f,g,h

Current Output:

e,f,g,h

pglove
  • 133
  • 1
  • 9
  • You should open the `outF` file with `a` not `w`. `w` will truncate the file. The other option is to open with `w+` and then seek to the end before writing new content, but that's what `a` is for. – sberry Apr 22 '18 at 03:33
  • You should also consider building up all of the data in some data structure (a list perhaps) and then opening the file for writing and write all of it in one pass rather than continually opening and closing the file. – sberry Apr 22 '18 at 03:34
  • @sberry, Had trouble doing it as a list yesterday as it was a string, but you've solved it with 'a' – pglove Apr 22 '18 at 03:39

2 Answers2

2

The problem is that you are opening the file outF with the w flag but should use the a flag instead.

changing

outF = open("C:\\Users\\####\\Desktop\\bnc.txt", "w")

to

outF = open("C:\\Users\\####\\Desktop\\bnc.txt", "a")

should solve the problem. You could also use w+ which will not truncate the file the way w does. But here's another idea altogether (which will work with w)

#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-A0B]*.xml")
out_lines = []
for filename in filenames:
    with open(filename, 'r', encoding="utf-8") as content:
        tree = ET.parse(content)
        root = tree.getroot()
        for w in root.iter('w'):
            lemma = w.get('hw')
            pos = w.get('pos')
            tag = w.get('c5')

            out_lines.append(w.text + "," + lemma + "," + pos + "," + tag)

with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as out_file:
    for line in out_lines:
        out_file.write("{}\n".format(line))  
sberry
  • 128,281
  • 18
  • 138
  • 165
  • 1
    you're a legend! Out of curiosity, what is the benefit of the alternative code you have given? Again, thanks for the solution :D – pglove Apr 22 '18 at 03:47
  • Note that opening the file in append mode (`"a"`) does not overwrite any contents that happened to be there before you ran your program, if the file already existed. If you only want the file to contain the output of your program, then open it in write mode (`"w"`). See [this answer](https://stackoverflow.com/a/1466036/972499). – Jordan Apr 22 '18 at 04:42
1

The problem is that on this line:

outF = open("C:\\Users\\####\\Desktop\\bnc.txt", "w")

The same file is opened and closed over and over again.

Behind the scenes:

When you call open, the Python interpreter makes a system call to the operating system, asking the OS to look for the file with that name and return an integer (called a "file descriptor" or "FD") that refers to the file. If the system call succeeds, then the interpreter receives a FD, stores the FD in a new Python object, and returns that object from the open function.

When you call write, the interpreter takes your string and stores it in an internal buffer. When the buffer fills up, or when the outF object is destroyed (as we will see), the interpreter makes a system call asking the OS to write the contents of the buffer to the file that the FD refers to.

When there are no more references to a Python object, the interpreter is free to garbage collect it. But first, the interpreter needs to internally call the object's __del__ method, a.k.a. the object's destructor. A file object's destructor makes a final system call to tell the OS "I don't need this FD anymore, and you can close the file."

This next part is subtle. open creates and returns a new object (we'll call it f1); outF = open(...) assigns the identifier outF to f1. f1's reference count (the amount of identifiers assigned to it) is now 1. On the next iteration of outF = open(...), you're telling the interpreter that you no longer want outF to refer to f1. f1's reference count drops to 0, allowing the garbage collector to destroy the object and close the file. This new call to open returns a new object (call it f2) that just so happens to refer to the file that was just closed. outF is assigned to f2, and f2's reference count is now 1.

There is no need to open and close the file over and over again. I recommend opening it before the loop:

#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-A0B]*.xml")
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as outF:
    for filename in filenames:
        with open(filename, 'r', encoding="utf-8") as content:
            tree = ET.parse(content)
            root = tree.getroot()
            for w in root.iter('w'):
                lemma = w.get('hw')
                pos = w.get('pos')
                tag = w.get('c5')

                outF.write(w.text + "," + lemma + "," + pos + "," + tag)
                outF.write("\n")

This has two advantages over building a list within the loop and then opening the file after the loop. This method iterates once instead of twice, and it requires a constant amount of space within the program's memory space (the constant size of the output buffer) instead of an amount of space that grows.

Jordan
  • 4,510
  • 7
  • 34
  • 42