4

I've been trying to use PyPy lately, and it's as much as 25x faster for my current project, and it's working pretty well. Unfortunately however, writing files is incredibly slow. Writing files is roughly 60 times slower.

I've been googling around a bit, but I haven't found anything helpful. Is this a known issue? Is there a workaround?

In a simple test case like this:

with file(path, 'w') as f:
    f.writelines(['testing to write a file\n' for i in range(5000000)])

I'm seeing a 60x slowdown in PyPy, compared to regular Python. This is using 64-bit 2.7.3 and PyPy 1.9, 32-bit and Python 2.7.2. Both are on the same OS and machine, of course (Windows 7).

Any help would be appreciated. PyPy is much faster for what I'm doing, but with file write speeds limited to half a megabyte per second, it's decidedly less useful.

Simon Lundberg
  • 1,413
  • 2
  • 11
  • 23
  • On linux those speed times are very comparable. PyPy for me is marginally slower (20%) for GC reasons (there is a branch to fix those though). Some sort of windows strangeness? Can you please put such things on bugs.pypy.org instead of here? stackoverflow is not a very good replacement for a bug tracker. – fijal Sep 25 '12 at 14:28
  • https://bugs.pypy.org/issue1268?@template=item&@pagesize=50&@startwith=0 – Simon Lundberg Sep 26 '12 at 12:55

4 Answers4

2

It's slower, but not 60x slower on this system

TLDR; Use write('\n'.join(...)) instead of writelines(...)

$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])"
10 loops, best of 3: 1.15 sec per loop

$ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])"
10 loops, best of 3: 434 msec per loop

xrange makes no difference

$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in xrange(5000000)])"
10 loops, best of 3: 1.15 sec per loop

Using a generator expression is slower for pypy, but faster for python

$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))"
10 loops, best of 3: 1.62 sec per loop
$ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))"
10 loops, best of 3: 407 msec per loop

moving creation of data outside the benchmark amplifies the difference (~4.2x)

$ pypy -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)"
10 loops, best of 3: 786 msec per loop
$ python -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)"
10 loops, best of 3: 189 msec per loop

Using write() instead of writelines() is much faster for both

$ pypy -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)"
10 loops, best of 3: 51.9 msec per loop
$ python -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)"
10 loops, best of 3: 52.4 msec per loop

$ uname -srvmpio
Linux 3.2.0-26-generic #41-Ubuntu SMP Thu Jun 14 17:49:24 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ python  --version
Python 2.7.3
$ pypy --version
Python 2.7.2 (1.8+dfsg-2, Feb 19 2012, 19:18:08)
[PyPy 1.8.0 with GCC 4.6.2]
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • When writing the data as a contiguous chunk, I'm getting much better performance in PyPy: the difference is about 4x, instead of 60x. Still significantly slower, though. I suppose I'll just stay away from writelines() and write the entire thing in one go. – Simon Lundberg Sep 26 '12 at 12:02
0

xrange is the answer for this example, as it doesn't generate list, but a generator. 64-bit python is probably faster than 32-bit pypy at generating a list with 50 millions items.

If you have another code, post the actual code, not just a test.

Community
  • 1
  • 1
unddoch
  • 5,790
  • 1
  • 24
  • 37
  • That doesn't explain the observation at all, as it's *at least* as true for CPython as for PyPy. Possibly PyPy even benefits (compared to CPython) from the use of `range`, as some versions include an optimization where the list is not actually materialized unless needed. (Also see the answer by Matthew Trevor and the comments.) –  Sep 25 '12 at 15:31
0

Let's first get your benchmarking method straight.

When the goal is to measure pure file-writing performance, it is a major flaw, a systematical error, to create the data to be written to file within code segment that you are timing. That's because data creation also takes time that you do not want to measure.

Hence, if you plan to keep the whole dummy data in memory, create it before measuring the time.

However, in your case, on-the-fly data generation is likely to be faster than your I/O will ever be. So by using a Python generator, in this case a generator expression, in combination with the write call, you get rid of this systematical error.

I don't know how writelines performs compared to write. However, according to your writelines example:

with file(path, 'w') as f:
    f.writelines('xxxx\n' for _ in xrange(10**6))

Writing large chunks of data with write might be faster:

with file(path, 'w') as f:
    for chunk in ('x'*99999 for _ in xrange(10**3)):
       f.write(chunk)

When you got the benchmarking right, I am pretty sure that you find differences between Python and PyPy. Maybe PyPy is even significantly slower under some circumstances. However, with proper benchmarking I believe that you will manage to find the conditions under which PyPy's file writing is fast enough for your purposes.

Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
  • I'm well aware that the benchmark included the time it took to create the dummy data. However, since writelines() wrote so incredibly slow in PyPy, it became a very marginal difference. When generating a list of strings with a list comprehension and writing them out, PyPy spends more than 300 times longer on the file writing than on the data generation. I'll try writing out chunks instead, and see if that's faster. Thanks! – Simon Lundberg Sep 26 '12 at 12:15
-1

You're generating two lists here, one with range and one with the list comprehension.

List 1: one option is to replace the list returning range with the generator xrange. Another is to try PyPy's own optimisation called range-lists.

You can enable this feature with the –objspace-std-withrangelist option.

List 2: you're creating your output list before writing it. This should also be a generator, so turn the list comprehension into a generator expression:

f.writelines('testing to write a file\n' for i in range(5000000))

As long as a generator expression is the only argument passed to a function, it's not even necessary to double-up on the parentheses.

Matthew Trevor
  • 14,354
  • 6
  • 37
  • 50