0

If I have the following code:

f = open("test.txt","w"):
x = []
for z in range(0,400)
    x.append(z)

How do I write the entire array out to disk while utilizing buffering?

Does this have automatic buffering built into it?

for z in range(0,len(x)):
    f.write(str(z))
f.close()

It looks as if it will make a single call to write, which I fear will make a single write operation to the disk, one character at a time, rather than buffering it for faster write speed.

Note: I researched this similar post, but did not find the answer to my question. Writing a list to a file with Python

Community
  • 1
  • 1
Matthew
  • 3,886
  • 7
  • 47
  • 84
  • 1
    The write command fills the internal io write buffer. Nothing is written until the OS says it should be. However, you can force it with a f.flush() operation. – CoconutBandit Jan 27 '17 at 21:07

1 Answers1

1

Indeed, as you're intuiting, never make N write operations to disk if you can reduce N to 1. For your particular case, let's measure what'd be the increase of reducing N to 1 by using json.dumps to serialize your test array:

import timeit
import json

x = []
for z in range(0, 800000):
    x.append(z)


def f1():
    with open("f1.txt", "w") as f:
        for z in range(0, len(x)):
            f.write(str(z))


def f2():
    with open("f1.txt", "w") as f:
        f.write(json.dumps(x))

N = 10

print(timeit.timeit('f1()', setup='from __main__ import f1', number=N))
print(timeit.timeit('f2()', setup='from __main__ import f2', number=N))

The output would be ~5.5x times faster:

8.25567383571545
1.508784956563721

If i increased the array size to 8000000 the output would be 3.6x times faster:

82.87025446748072
22.56552114259503

And if i set N=1 and the array size to 8000000 the output would be 3.8x times faster:

8.355991964475688
2.227114091734297

Not saying using json.dumps is the best way to dump your custom data but is a reasonable good one. My point is, try to reduce always the number of disk operations at minimum if possible. Disk write operations are generally quite expensive.

Additional info: If you want to a good comparison about faster serialization methods, I suggest you take a look to this article

EDIT: Added @ShadowRanger's suggestion to add a couple of more tests to the experiment comparing write&writelines, here you go:

import timeit
import json

K = 8000000
x = []
for z in range(0, K):
    x.append(z)


def f1():
    with open("f1.txt", "w") as f:
        for z in range(0, len(x)):
            f.write(str(z))


def f2():
    with open("f1.txt", "w") as f:
        f.write(json.dumps(x))


def f3():
    with open("f1.txt", "w") as f:
        f.write(''.join(map(str, x)))


def f4():
    with open("f1.txt", "w") as f:
        f.writelines(map(str, x))

N = 1

print(timeit.timeit('f1()', setup='from __main__ import f1', number=N))
print(timeit.timeit('f2()', setup='from __main__ import f2', number=N))
print(timeit.timeit('f3()', setup='from __main__ import f3', number=N))
print(timeit.timeit('f4()', setup='from __main__ import f4', number=N))

The output would be:

8.369193972488317
2.154246855128056
2.667741175272406
8.156553772208122

Where you still can see that write is faster than writelines and how a simple serialization method like json.dumps is slightly faster than a manual one like one using pure python code ie: ''.join(map(str, x))

BPL
  • 9,632
  • 9
  • 59
  • 117
  • 1
    You're not really measuring what you think you're measuring. Python does buffer internally, so the actual number of system calls is a small fraction of the number of `write` calls made. The problem is that Python has extreme overhead for executing Python level code, particularly when it comes to overhead from calling C level functions that do very little work from the Python layer, so the overhead of the function call outweighs the work, and this is a pathological case. `json.dumps` is internally doing most of the work in a single C function call, and that saves a huge amount. – ShadowRanger Jan 27 '17 at 21:45
  • 1
    @ShadowRanger I see, do you know any way to measure the real number of system calls in the above code? It'd be interesting to compare that instead to have a real comparison – BPL Jan 27 '17 at 21:51
  • For a more fair comparison, use similar methods of testing: Compare `f.write(''.join(map(str, range(somenum))))` to `f.writelines(map(str, range(somenum)))`. They do the same work to convert from `int` to `str`, the former just combines all the `str` into a single `str` to do one `write` call, the latter uses `writelines` (which is [basically just `for x in seq: self.write(x)`](https://hg.python.org/cpython/file/3.5/Modules/_io/iobase.c#l705), but in C, avoiding interpreter overhead). The difference between the two is noticeable, but it's only on the order of ~50% longer for `writelines`. – ShadowRanger Jan 27 '17 at 21:53
  • On Linux, `strace -c` works. I tried `strace -c python3 -c 'with open("/path/to/file", "w") as f: f.writelines(map(str, range(400000)))'` and `strace -c python3 -c 'with open("/path/to/file", "w") as f: f.write("".join(map(str, range(400000))))'`. The latter had one syscall for `write` taking 769 us, the former had 40 syscalls for `write` taking 0 us each (it's rounded, possibly truncated, so that number is kind of a lie). Even so, the smaller writes have lower call overhead (presumably because the OS lazily writes them while Python is doing the work to set up the next `write`). – ShadowRanger Jan 27 '17 at 22:00
  • BTW, interesting numbers. I ran similar tests on my Linux box with `ipython` `%timeit` magic, and it was slightly over 50% longer to do `writelines(...)` vs. `write(''.join(...))`, not 4x longer. I'm guessing some difference in OS or filesystem. – ShadowRanger Jan 27 '17 at 22:07