4

Copying a File using a straight-forward approach in Python is typically like this:

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

(This code snippet is from shutil.py, by the way).

Unfortunately, this has drawbacks in my special use-case (involving threading and very large buffers) [Italics part added later]. First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.

To avoid this I'm using the file.readinto() method which, unfortunately, is documented as deprecated and "don't use":

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    buffer = array.array('c')
    buffer.fromstring('-' * length)
    while True:
        count = fsrc.readinto(buffer)
        if count == 0:
            break
        if count != len(buffer):
            fdst.write(buffer.toString()[:count])
        else:
            buf.tofile(fdst)

My solution works, but there are two drawbacks as well: First, readinto() is not to be used. It might go away (says the documentation). Second, with readinto() I cannot decide how many bytes I want to read into the buffer and with buffer.tofile() I cannot decide how many I want to write, hence the cumbersome special case for the last block (which also is unnecessarily expensive).

I've looked at array.array.fromfile(), but it cannot be used to read "all there is" (reads, then throws EOFError and doesn't hand out the number of processed items). Also it is no solution for the ending special-case problem.

Is there a proper way to do what I want to do? Maybe I'm just overlooking a simple buffer class or similar which does what I want.

Alfe
  • 56,346
  • 20
  • 107
  • 159
  • 1
    Is there any reason why you cannot use shutil.copyfile(src, dst)? – Maria Zverina Mar 20 '12 at 17:24
  • 2
    Python is a high-level language. It _already has_ all this stuff built in. Don't rewrite it. – Katriel Mar 20 '12 at 17:26
  • For specific reasons I'm using very large buffers in a multi-threaded environment. There's always a good chance that between freeing and allocating the memory anew another thread is messing up the memory (getting a small piece of the large chunk). In this case it really slows down everything, leaves holes in the memory and eventually throws MemoryErrors in rare cases. I'm trying to avoid this by not allocating and freeing the memory. That's why I'm looking for a replacement of the old readinto(). – Alfe Mar 21 '12 at 09:52

2 Answers2

5

This code snippet is from shutil.py

Which is a standard library module. Why not just use it?

First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.

This is tiny compared to the effort required to actually grab a page of data from disk.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • 6
    Also the heap will probably just give you the same memory back, since you've _just_ freed it. – Katriel Mar 20 '12 at 17:26
  • Not to mention how much is lost by incurring the overhead of going through Python in the first place. That said, it's at least a little surprising that the CPython `shutil` doesn't implement these things directly in C, given how trivial the Python implementation apparently is, and given how much else is done that way. – Karl Knechtel Mar 20 '12 at 18:45
  • 1
    I've measured it and found a significant performance increase without the unnecessary memory management. Since I'm developing a tool to copy large amounts of data, I'd like to wait for the I/O instead of the memory management. So the question is not whether to use file.read()/file.write() but whether there is a better replacement for the deprecated readinto(). – Alfe Mar 21 '12 at 08:52
  • I find that hard to believe without measurements. CPython allocates memory in chunks which means that allocating space for objects does not require any system calls. – Björn Lindqvist Mar 21 '12 at 09:07
  • Okay, I now see that my example above was a little misleading. My real scenario is more complex. It reads a large chunk (~10MB) in one thread and writes it in another; other threads allocate memory as well. This can leave holes. Anyway, don't stick too close to the example. I'm looking for a replacement for the readinto() for reading into an existing buffer, and maybe also for writing from that buffer. This general usecase looks so obvious to me that I'm surprised that (besides a deprecated version) nothing exists for this in the standard python lib. – Alfe Mar 21 '12 at 09:57
3

Normal Python code would not be in need off such tweaks as this - however if you really need all that performance tweaking to read files from inside Python code (as in, you are on the rewriting some server code you wrote and already works for performance or memory usage) I'd rather call OS directly using ctypes - thus having a copy performed as low level as I wanted to.

It may even be possible that simple calling the "cp" executable as an external process is less of a hurdle in your case (and it would take full advantages of all OS and filesystem level optimizations for you).

h2ku
  • 947
  • 8
  • 11
jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • Nice idea. It will drop my portability (to Windows, for instance). But I might consider this, thanks :-) – Alfe Mar 21 '12 at 08:54