54

As a complete beginner to programming, I am trying to understand the basic concepts of opening and closing files. One exercise I am doing is creating a script that allows me to copy the contents from one file to another.

in_file = open(from_file)
indata = in_file.read()

out_file = open(to_file, 'w')
out_file.write(indata)

out_file.close()
in_file.close()

I have tried to shorten this code and came up with this:

indata = open(from_file).read()
open(to_file, 'w').write(indata)

This works and looks a bit more efficient to me. However, this is also where I get confused. I think I left out the references to the opened files; there was no need for the in_file and out_file variables. However, does this leave me with two files that are open, but have nothing referring to them? How do I close these, or is there no need to?

Any help that sheds some light on this topic is much appreciated.

svick
  • 236,525
  • 50
  • 385
  • 514
Roy Crielaard
  • 645
  • 6
  • 11
  • 3
    Side-note: What you're doing is subject to memory issues for large input files; I'd suggest taking a look at [`shutil.copyfileobj`](https://docs.python.org/3/library/shutil.html#shutil.copyfileobj) and related methods for doing this (which explicitly copy block by block, so peak memory consumption is fixed, not dependent on the size of the input file). – ShadowRanger Mar 16 '16 at 20:34
  • 3
    There's a good discussion of this at http://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important – snakecharmerb Mar 16 '16 at 20:36
  • 10
    What do you mean by "looks a bit more efficient"? If you mean the source code is shorter, well, that's certainly true. But it's not generally considered a priority to make the code as short as possible in Python. Much more important is readability. If you mean you think it will execute more efficiently, I'd say it probably won't make enough difference to matter, though to be sure you always want to measure. Why it shouldn't make much difference is that creating and assigning variables is cheap. In Python, all assignment amounts to setting a pointer. – John Y Mar 16 '16 at 21:30
  • @JohnY - That is an excellent question. I think I had the assumption that "shorter" or "involving less intermediate steps" or "less variables" is good. You make a good point that it is actually readability that is more important (especially in a script with more code than my simple exercise). When you say assigning variables is cheap, I assume that means it will not cost much memory? – Roy Crielaard Mar 17 '16 at 09:30
  • 2
    @RoyCrielaard indeed - a "variable" is just a reference to data that must _already exist_ in memory, it is the size of a memory pointer. Assignment is cheap both in terms of _size_ and _speed_. Be aware that even in your first example you have file handle leak issues - if the writing crashes, for example, nothing will be closed. – Boris the Spider Mar 17 '16 at 09:54
  • "Fewer intermediate steps" is important when a block of code gets executed millions of times, or when the "steps" are really large (like reading in a whole file). A single assignment is trivial; in fact, you can often make a program *faster* by assigning a non-local function or reference to a local variable. Readability is indeed always more important. – alexis Mar 24 '16 at 18:49

6 Answers6

58

The pythonic way to deal with this is to use the with context manager:

with open(from_file) as in_file, open(to_file, 'w') as out_file:
    indata = in_file.read()
    out_file.write(indata)

Used with files like this, with will ensure all the necessary cleanup is done for you, even if read() or write() throw errors.

RoadieRich
  • 6,330
  • 3
  • 35
  • 52
  • 25
    Note: In the CPython reference interpreter, `open`ing and immediately `read`ing/`write`-ing to the returned file object (without storing a reference to the open file object) happens to be predictable and safe, but that is not the case on most other interpreters, and it's not a contractual guarantee of the language (just a side-effect of CPython reference counting). Using `with` is more scalable, portable, and predictable, so always use it, even if it seems like it works fine without it. – ShadowRanger Mar 16 '16 at 20:30
  • Are you saying that `open(foo).read()` isn't reliable on other implementations? That's horrifying; which ones behave so? – Russell Borogove Mar 16 '16 at 20:52
  • 8
    @RussellBorogove: Probably all the big ones; I don't know of any non-CPython implementations that do refcounting. A quick search shows [official PyPy documentation](http://pypy.org/compat.html) warning that `open("filename", "w").write("stuff")` will only write to the file once the GC runs a collection, which is a pretty similar case. (If by "isn't reliable", you mean something like "might crash" or "might give wrong results" rather than "might leak file handles", `open(foo).read()` isn't any more prone to that than `with open(foo) as f: contents = f.read()`.) – user2357112 Mar 16 '16 at 21:19
  • I'm not concerned with filehandle leaks; I'm very concerned about lost writes in particular. – Russell Borogove Mar 16 '16 at 23:29
  • 3
    @RussellBorogove `open(foo).read()` should be 'safe', as during the execution of `read()` the file object will still be referenced. But there's no guarantee about when (or in what order) the closes will happen when you have multiple file operations, just that it will be "some time after the file objects are not referenced". This means writes can be arbitrarily delayed (or possibly even lost if objects still referenced when the program exits don't have their finalizers run, which not even CPython absolutely guarantees: https://docs.python.org/2/reference/datamodel.html#object.__del__) – Ben Mar 17 '16 at 02:11
  • this question has a Python 2.7 tag – Jared Goguen Mar 17 '16 at 02:14
  • @jared updated the link, but nothing else is different between the two versions, at least as far as this answer is concerned. – RoadieRich Mar 17 '16 at 02:33
  • I could have sworn you needed `from __future__ import with_statement` in 2.X. Wow, I'm going crazy... – Jared Goguen Mar 17 '16 at 02:41
  • 1
    @jared You only need the `__future__` import in 2.5. 2.6 had `with` statements with a single context, 2.7 backported multiple contexts, as seen above, from version 3.1. Prior to 2.5, you're out of luck, however. – RoadieRich Mar 17 '16 at 02:43
  • @ShadowRanger - I had to look up what CPython actually means. For anyone else interested, this discussion is interesting: http://stackoverflow.com/questions/17130975/python-vs-cpython – Roy Crielaard Mar 17 '16 at 09:44
40

The default python interpeter, CPython, uses reference counting. This means that once there are no references to an object, it gets garbage collected, i.e. cleaned up.

In your case, doing

open(to_file, 'w').write(indata)

will create a file object for to_file, but not asign it to a name - this means there is no reference to it. You cannot possibly manipulate the object after this line.

CPython will detect this, and clean up the object after it has been used. In the case of a file, this means closing it automatically. In principle, this is fine, and your program won't leak memory.

The "problem" is this mechanism is an implementation detail of the CPython interpreter. The language standard explicitly gives no guarantee for it! If you are using an alternate interpreter such as pypy, automatic closing of files may be delayed indefinitely. This includes other implicit actions such as flushing writes on close.

This problem also applies to other resources, e.g. network sockets. It is good practice to always explicitly handle such external resources. Since python 2.6, the with statement makes this elegant:

with open(to_file, 'w') as out_file:
    out_file.write(in_data)

TLDR: It works, but please don't do it.

MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119
38

You asked about the "basic concepts", so let's take it from the top: When you open a file, your program gains access to a system resource, that is, to something outside the program's own memory space. This is basically a bit of magic provided by the operating system (a system call, in Unix terminology). Hidden inside the file object is a reference to a "file descriptor", the actual OS resource associated with the open file. Closing the file tells the system to release this resource.

As an OS resource, the number of files a process can keep open is limited: Long ago the per-process limit was about 20 on Unix. Right now my OS X box imposes a limit of 256 open files (though this is an imposed limit, and can be raised). Other systems might set limits of a few thousand, or in the tens of thousands (per user, not per process in this case). When your program ends, all resources are automatically released. So if your program opens a few files, does something with them and exits, you can be sloppy and you'll never know the difference. But if your program will be opening thousands of files, you'll do well to release open files to avoid exceeding OS limits.

There's another benefit to closing files before your process exits: If you opened a file for writing, closing it will first "flush its output buffer". This means that i/o libraries optimize disk use by collecting ("buffering") what you write out, and saving it to disk in batches. If you write text to a file and immediately try to reopen and read it without first closing the output handle, you'll find that not everything has been written out. Also, if your program is closed too abruptly (with a signal, or occasionally even through normal exit), the output might never be flushed.

There's already plenty of other answers on how to release files, so here's just a brief list of the approaches:

  1. Explicitly with close(). (Note for python newbies: Don't forget the parens! My students like to write in_file.close, which does nothing.)

  2. Recommended: Implicitly, by opening files with the with statement. The close() method will be called when the end of the with block is reached, even in the event of abnormal termination (from an exception).

    with open("data.txt") as in_file:
        data = in_file.read()
    
  3. Implicitly by the reference manager or garbage collector, if your python engine implements it. This is not recommended since it's not entirely portable; see the other answers for details. That's why the with statement was added to python.

  4. Implicitly, when your program ends. If a file is open for output, this may run a risk of the program exiting before everything has been flushed to disk.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161
  • *"the number of files a process can keep open is limited (long ago the limit was about 20 on Unix"* I can hardly see this as an issue since on any modern Unix-like, the limit is so high the system will run out of memory or CPU time before the limit is reached. – cat Mar 17 '16 at 19:28
  • Thanks... I looked quickly but didn't find any hard numbers, so I went from memory. Added a little research now. (While the _hard_ limits may be astronomical or unlimited, the number is usually limited with `ulimit` and the like.) – alexis Mar 17 '16 at 22:23
  • The default maximum number of file descriptors on Linux for the last decades, and still default on most Linux distributions today, [is `1024`](https://www.baeldung.com/linux/limit-file-descriptors#ulimit-and-soft-limits). You can check it with `ulimit -n`. It can be changed. – nh2 Jun 29 '23 at 02:25
7

It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks:

>>> with open('workfile', 'r') as f:
...     read_data = f.read()
>>> f.closed
True
python
  • 4,403
  • 13
  • 56
  • 103
7

The answers so far are absolutely correct when working in python. You should use the with open() context manager. It's a great built-in feature, and helps shortcut a common programming task (opening and closing a file).

However, since you are a beginner and won't have access to context managers and automated reference counting for the entirety of your career, I'll address the question from a general programming stance.

The first version of your code is perfectly fine. You open a file, save the reference, read from the file, then close it. This is how a lot of code is written when the language doesn't provide a shortcut for the task. The only thing I would improve is to move close() to where you are opening and reading the file. Once you open and read the file, you have the contents in memory and no longer need the file to be open.

in_file = open(from_file)
indata = in_file.read()
out_file.close() 

out_file = open(to_file, 'w')
out_file.write(indata)
in_file.close()
Community
  • 1
  • 1
Darrick Herwehe
  • 3,553
  • 1
  • 21
  • 30
5

A safe way to open files without having to worry that you didn't close them is like this:

with open(from_file, 'r') as in_file:
    in_data = in_file.read()

with open(to_file, 'w') as out_file:
    outfile.write(in_data)
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
Mark Skelton
  • 3,663
  • 4
  • 27
  • 47