27

Python 3 appears to be slower in enumerations for a minimum loop than Python 2 by a significant margin, which appears to be getting worse with newer versions of Python 3.

I have Python 2.7.6, Python 3.3.3, and Python 3.4.0 installed on my 64-bit windows machine, (Intel i7-2700K - 3.5 GHz) with both 32-bit and 64-bit versions of each Python installed. While there is no significant difference in execution speed between 32-bit and 64-bit for a given version within its limitations as to memory access, there is a very significant difference between different version levels. I'll let the timing results speak for themselves as follows:

C:\**Python34_64**\python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: **900 msec** per loop

C:\**Python33_64**\python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: **820 msec** per loop

C:\**Python27_64**\python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: **480 msec** per loop

Since the Python 3 "range" is not the same as Python 2's "range", and is functionally the same as Python 2's "xrange", I also timed that as follows:

C:\**Python27_64**\python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in **xrange**(10000000): cnt += 1"
5 loops, best of 2: **320 msec** per loop

One can easily see that version 3.3 is almost twice as slow as version 2.7 and Python 3.4 is about 10% slower than that again.

My question: Is there an environment option or setting that corrects this, or is it just inefficient code or the interpreter doing more for the Python 3 version?


The answer seems to be that Python 3 uses the "infinite precision" integers that used to be called "long" in Python 2.x its default "int" type without any option to use the Python 2 fixed bit-length "int" and it is processing of these variable length "int"'s that is taking the extra time as discussed in the answers and comments below.

It may be that Python 3.4 is somewhat slower than Python 3.3 because of changes to memory allocation to support synchronization that slightly slow memory allocation/deallocation, which is likely the main reason that the current version of "long" processing runs slower.

GordonBGood
  • 3,467
  • 1
  • 37
  • 45
  • 1
    At least because in Python 3 default type for integer is `long long` – vaultah May 04 '14 at 06:01
  • I concur with frostnational. Try to change `cnt = 0` to `cnt = 0L` in the python2 benchmarks and see if this makes a difference. I just tried it on my machine and python3 is twice as fast respect python2's `range` and 30% *faster* than python's 2 `xrange` (at least on my box). – Bakuriu May 04 '14 at 06:47
  • @frostnational, no, worse than "long long" or 64-bits but "infinite precision" integers with as many bytes/words as necessary to represent a given value. There doesn't seem to be a way to get the speed back other than wait for Python to optimize the implementation, if possible. – GordonBGood May 04 '14 at 10:59

2 Answers2

29

A summary answer of what I've learned from this question might be of help to others who wonder the same things as I did:

  1. The reason for the slowdown is that all integer variables in Python 3.x are now "infinite precision" as the type that used to be called "long" in Python 2.x but is now the only integer type as decided by PEP 237. As per that document, "short" integers that had the bit-depth of the base architecture no longer exist (or only internally).

  2. The old "short" variable operations could run reasonably fast because they could use the underlying machine code operations directly and optimized the allocation of new "int" objects because they always had the same size.

  3. The "long" type is currently only represented by a class object allocated in memory as it could exceed a given fixed length register/memory location's bit-depth; since these object representations could grow or shrink for various operations and thus have a variable size, they cannot be given a fixed memory allocation and left there.

  4. These "long" types (currently) don't use a full machine architecture word size but reserve a bit (normally the sign bit) to do overflow checks, thus the "infinite precision long" is divided (currently) into 15-bit/30-bit slice "digits" for 32-bit/64-bit architectures, respectively.

  5. Many of the common uses of these "long" integers won't require more than one (or maybe two for 32-bit architectures) "digits" as the range of one "digit" is about one billion/32768 for 64-bit/32-bit architectures, respectively.

  6. The 'C' code is reasonably efficient for doing one or two "digit" operations, so the performance cost over the simpler "short" integers isn't all that high for many common uses as far as actual computation goes as compared to the time required to run the byte-code interpreter loop.

  7. The biggest performance hit is likely the constant memory allocations/deallocations, one pair for each loop integer operations which is quite expensive, especially as Python moves to support multi-threading with synchronization locks (which is likely why Python 3.4 is worse than 3.3).

  8. Currently, the implementation always ensures sufficient "digits" by allocating one extra "digit" above the actual size of "digits" used for the biggest operand if there is a possibility that it might "grow", doing the operation (which may or may not actually use that extra "digit"), and then normalizes the result length to account for the actual number of "digits" used, which may actually stay the same (or possibly "shrink" for some operations); this is done by just reducing the size count in the "long" structure without a new allocation so may waste one "digit" of memory space but saving the performance cost of yet another allocation/deallocation cycle.

  9. There is hope for an performance improvement: For many operations it is possible to predict whether the operation will cause a "grow" or not - for instance, for an addition one just needs to look at the Most Significant Bits (MSB's) and the operation cannot grow if both MSB's are zero, which will be the case for many loop/counter operations; a subtraction won't "grow" depending on the signs and MSB's of the two operands; a left shift will only "grow" if the MSB is a one; et cetera.

  10. For those cases where the statement is something like "cnt += 1"/"i += step" and so on (opening the possibility of operations in place for many use cases), an "in place" version of the operations could be called which would do the appropriate quick checks and only allocate a new object if a "grow" was necessary, otherwise doing the operation in place of the first operand. The complication would be that the compiler would need to produce these "in-place" byte codes, however, that has already been done, with appropriate special "in-place operation" byte codes produced, just that the current byte-code interpreter directs them to the usual version as described above because they have not yet been implemented (zero'd/null values in the table).

  11. It may well be that all that has to be done is write versions of these "in-place operations" and fill them into the "long" methods table with the byte-code interpreter already finding and running them if they exist or minor changes to a table to cause it to call them being all that is required.

Note that floats are always the same size, so could have this same improvements made, although floats are allocated in blocks of spare locations for better efficiency; it would be much harder to do that for "long"'s as they take a variable amount of memory.

Also note that this would break the immutability of "long"'s (and optionally float's), which is why there are no inplace operators defined, but the fact that they are treated as mutable only for these special cases doesn't affect the outside world as it would never realize that sometimes a given object has the same address as the old value (as long as equality comparisons look at contents and not just object addresses).

I believe that by avoiding the memory allocation/delocation for these common use cases, the performance of Python 3.x will be quite close to Python 2.7.

Much of what I've learned here comes from the Python trunk 'C' source file for the "long" object


EDIT_ADD: Whoops, forgot that if variables are sometimes mutable, then closures on local variables don't work or don't work without major changes, meaning that the above inplace operations would "break" closures. It would seem that a better solution would be to get advance spare allocation apace working for "long"'s just as it used to for short integer's and does for float's, even if only for the cases where the "long" size doesn't change (which covers the majority of the time such as for loops and counters as per the question). Doing this should mean that the code doesn't run that much slower than Python 2 for typical use.

GordonBGood
  • 3,467
  • 1
  • 37
  • 45
21

The difference is due to the replacement of the int type with the long type. Obviously operations with long integers are going to be slower because the long operations are more complex.

If you force python2 to use longs by setting cnt to 0L the difference goes away:

$python2 -mtimeit -n5 -r2 -s"cnt=0L" "for i in range(10000000): cnt += 1L"
5 loops, best of 2: 1.1 sec per loop
$python3 -mtimeit -n5 -r2 -s"cnt=0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: 686 msec per loop
$python2 -mtimeit -n5 -r2 -s"cnt=0L" "for i in xrange(10000000): cnt += 1L"
5 loops, best of 2: 714 msec per loop

As you can see on my machine python3.4 is faster than both python2 using range and using xrange when using longs. The last benchmark with python's 2 xrange shows that the difference in this case is minimal.

I don't have python3.3 installed, so I cannot make a comparison between 3.3 and 3.4, but as far as I know nothing significant changed between these two versions (regarding range), so the timings should be about the same. If you see a significant difference try to inspect the generated bytecode using the dis module. There was a change about memory allocators (PEP 445) but I have no idea whether the default memory allocators were modified and which consequences there were performance-wise.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • 5
    You've explained why Python 3 is slower using long's, thank you, and you've shown how to make Python 2 almost as slow as Python 3.3 and not quite as slow as 3.4 on my machine (it would likely be even slower if one could force the xrange to long's as well); however, you haven't given me a way to make Python 3 as fast as Python 2 when one doesn't need the range of long's. I'm guessing that there isn't any way as Python has dropped 32-bit integers. That seems to be an oversight as in no concern for performance? For simple counting there are built-in methods, but not for iterators/generators – GordonBGood May 04 '14 at 07:38
  • @GordonBGood: Good question. Anyway, performance is probably not the most important goal of the Python development. There are always tradeoffs. I can imagine that Python 3 will add the optimization later. – pepr May 04 '14 at 10:10
  • @pepr, Bakuriu got me thinking and reading and I now see that Python 3 doesn't just use int 'C' type "long long"'s = 64-bits but rather both Python 2 "long"'s and Python 3 "int" (same as Python 2 "long") are infinite precision integers with as many bits/bytes as needed to represent the value. The change is the dropping of the usual fixed length integers and only support these infinite precision integers. It well explains why enumerating over Python 3's range's is so slow and will get slower the larger the values used. Yes, these should be able to be optimized for smallish ranges like this. – GordonBGood May 04 '14 at 10:54
  • @GordonBGood Anyway, keep in mind that *any* operation between python objects will have much more overhead than "raw" operations in C. For example, even in python2 `1 + 2` the `+` is a full blown method call. Also even the "short integers" actually used more memory then a simple C integer: *every* python object must *at least* have a pointer to its class and the reference count, which means a 4 byte short integer will probably use about 12 bytes of memory. From this point of view operations on `long`s are not that much slower, although there is a significant difference. – Bakuriu May 04 '14 at 11:23
  • 2
    Understood on the memory use that these "long" integers only use more memory when they need it for extra precision; my main concern is the performance hit. It already is optimized pretty well considering that an equivalent loop in F# using infinite precision BigInt's for both the loop index and the count variables takes about 2.3 seconds or a little under three times as slow. I also see the point in eliminating the extra type, but lament the cost in execution speed of almost three times (at least on my machine): [PEP 237](http://legacy.python.org/dev/peps/pep-0237/) – GordonBGood May 04 '14 at 14:27
  • This answer is 6+ years old and I'm wondering if it still holds true for Python `2.7.12` versus `3.5` which are currently installed in Ubuntu 16.04? I repeated your examples and it does seem true still today. – WinEunuuchs2Unix Mar 14 '21 at 01:05
  • 1
    @WinEunuuchs2Unix Ubuntu 16.04 is old, python3.5 is also old and python 2.7.12 is also old. On my Ubuntu 20.04 machine with python 2.7.18 and python3.8.5 I see python3 is faster than python2 in all cases: `python2 -mtimeit -n5 -r2 -s"cnt=0" "for i in range(10000000): cnt += 1" 5 loops, best of 2: 300 msec per loop` while python3: `python3 -mtimeit -n5 -r2 -s"cnt=0" "for i in range(10000000): cnt += 1" 5 loops, best of 2: 285 msec per loop` Maybe the difference can be attributed to 32 vs 64 bit? In 2014 I was using a 32bit laptop I believe, now I have a much better 64bit desktop. – Bakuriu Mar 14 '21 at 09:21
  • @Bakuriu Yes I have 20.04 upgrade on my radar. My python programs contain header stub `/usr/bin/env python` so I don't actually specify python 2 or 3. I assume on Ubuntu 20.04 it will default to Python 3.8.5 automatically and not Python 2.7.18? My python code should already be written for compatibility with both versions though. EG it tests version before executing different code. My 32 bit machines are long gone too now. Everything is 64 bit, except perhaps smart wall power plugs. – WinEunuuchs2Unix Mar 14 '21 at 20:12