19

(Update: Might only happen in CPython 3.8 32-bit for Windows, so don't be surprised if you can't reproduce it in other versions. See tables in the Update section.)

Both iter and reversed result in specialized iterators for lists:

>>> iter([1, 2, 3])
<list_iterator object at 0x031495C8>

>>> reversed([1, 2, 3])
<list_reverseiterator object at 0x03168310>

But the reversed one is much slower:

> python -m timeit -s "a = list(range(1000))" "list(iter(a))"
50000 loops, best of 5: 5.76 usec per loop

> python -m timeit -s "a = list(range(1000))" "list(reversed(a))"
20000 loops, best of 5: 14.2 usec per loop

And I can consistently reproduce it. I later tried iter five more times, took 5.98, 5.84, 5.85, 5.87, 5.86. Then reversed five more times, took 14.3, 14.4, 14.4, 14.5, 14.3.

I thought maybe iter benefits from increasing memory locations of the list's elements, so I tried reversing the list beforehand. Same picture:

> python -m timeit -s "a = list(range(1000)); a.reverse()" "list(iter(a))"
50000 loops, best of 5: 5.73 usec per loop

> python -m timeit -s "a = list(range(1000)); a.reverse()" "list(reversed(a))"
20000 loops, best of 5: 14.1 usec per loop

Same picture with sum as well:

> python -m timeit -s "a = list(range(1000))" "sum(iter(a))"
20000 loops, best of 5: 10.7 usec per loop

> python -m timeit -s "a = list(range(1000))" "sum(reversed(a))"
10000 loops, best of 5: 20.9 usec per loop

And with identical elements, too:

> python -m timeit -s "a = [None] * 1000" "list(iter(a))"
50000 loops, best of 5: 6.35 usec per loop

> python -m timeit -s "a = [None] * 1000" "list(reversed(a))"
20000 loops, best of 5: 14.5 usec per loop

Why is the reverse iterator so much slower?

I'm using CPython 3.8.1 32 bit on Windows 10 pro 64 bit version 1903 with an Intel i5-7200U (it's a HUAWEI MateBook X). No special configuration, just a normal Python install on a normal Windows install.

Update: I ran a larger automated test with eight different Python versions (all freshly installed with default settings) on another machine (Pentium N3700, Windows 10 Pro 64-bit 1903). Times in usec per loop:

                32-bit              64-bit
CPython     iter   reversed     iter   reversed
 3.5.4      16.6     17.0       15.2     16.2
 3.6.8      16.8     17.2       14.9     15.8
 3.7.6      16.5     16.9       14.8     15.5
 3.8.1      16.3     22.1       14.6     15.5

Two things to note:

  1. Python 3.8.1 32-bit reversed is the only one much slower. Might explain why almost nobody else could reproduce it.
  2. In all seven other versions, reversed was a bit slower than iter. About 0.4 usec in 32-bit and about 0.9 usec in 64-bit.

I ran those 16 tests in Round-robin fashion for ten rounds, and each time shown above is the best of its ten source times. Each of the 160 source times was done like this:

python.exe -m timeit -r 5 -s "a = list(range(1000))" "list(iter(a))"
or
python.exe -m timeit -r 5 -s "a = list(range(1000))" "list(reversed(a))"

The times for each of the 16 tests were pretty consistent. Full table (note that the round-robin means I ran these column by column, not row by row):

3.5.4 32-bit iter     [16.7, 16.6, 17.3, 16.6, 16.7, 16.6, 16.6, 16.6, 16.6, 16.7]
3.5.4 32-bit reversed [17.1, 17.1, 17.1, 17.2, 17.1, 17.1, 17.0, 17.1, 17.1, 17.1]
3.5.4 64-bit iter     [15.2, 15.4, 15.4, 15.4, 15.4, 15.4, 15.4, 15.3, 15.4, 15.3]
3.5.4 64-bit reversed [16.8, 16.2, 16.3, 16.3, 16.2, 16.2, 16.2, 16.2, 16.2, 16.3]
3.6.8 32-bit iter     [17.3, 16.9, 16.8, 16.9, 16.9, 16.8, 16.9, 16.9, 16.8, 16.8]
3.6.8 32-bit reversed [17.2, 17.2, 17.2, 17.3, 17.3, 17.3, 17.3, 17.2, 17.2, 17.2]
3.6.8 64-bit iter     [15.0, 14.9, 15.9, 14.9, 14.9, 15.0, 14.9, 14.9, 14.9, 14.9]
3.6.8 64-bit reversed [15.8, 15.9, 16.4, 15.9, 15.9, 16.0, 15.8, 15.9, 15.9, 15.8]
3.7.6 32-bit iter     [16.6, 17.2, 16.6, 16.5, 16.7, 16.7, 16.5, 16.5, 16.5, 16.7]
3.7.6 32-bit reversed [17.2, 17.6, 17.0, 17.0, 16.9, 17.2, 17.3, 17.0, 17.5, 17.0]
3.7.6 64-bit iter     [14.8, 15.1, 14.9, 14.9, 14.8, 15.1, 14.9, 14.8, 15.0, 14.9]
3.7.6 64-bit reversed [16.0, 20.1, 15.7, 15.6, 15.6, 15.6, 15.7, 15.7, 15.8, 15.5]
3.8.1 32-bit iter     [16.4, 16.6, 16.3, 16.4, 16.5, 16.4, 16.5, 16.4, 16.8, 16.4]
3.8.1 32-bit reversed [22.3, 22.4, 22.2, 22.3, 22.3, 22.3, 22.5, 22.4, 22.3, 22.1]
3.8.1 64-bit iter     [14.6, 15.1, 14.6, 14.7, 14.7, 14.7, 14.7, 14.6, 14.6, 14.6]
3.8.1 64-bit reversed [15.5, 16.1, 15.5, 15.6, 15.5, 15.5, 15.5, 15.5, 15.5, 15.5]

The same test on a list with a million values (list(range(250)) * 4000). Times are msec per loop:

                32-bit              64-bit
CPython     iter   reversed     iter   reversed
 3.5.4      19.8     19.9       22.4     22.7
 3.6.8      19.8     19.9       22.3     22.6
 3.7.6      19.9     19.9       22.3     22.5
 3.8.1      19.8     24.9       22.4     22.6

The variation is even smaller, except reversed on 3.8.1 32-bit is much slower again.

One more, just with CPython 3.8.0 instead of 3.8.1, where it also happens.

                32-bit              64-bit
CPython     iter   reversed     iter   reversed
 3.5.4      19.5     19.6       21.9     22.2
 3.6.8      19.5     19.7       21.8     22.1
 3.7.6      19.5     19.6       21.7     22.0
 3.8.0      19.4     24.5       21.7     22.1
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • 12
    I can't reproduce this. I get a touch under 3.5 µsec per loop in both cases. Python 3.8.1 on Ubuntu via WSL on Windows 10. And a touch under 4 µsec per loop in both cases with Python 3.7.4 on Windows 10 via Anaconda. – ChrisGPT was on strike Jan 31 '20 at 13:59
  • 2
    i get pretty similar numbers on the first example: 3.55/3.63 ... using debian tho. – garglblarg Jan 31 '20 at 14:01
  • 2
    Same, I've got similar numbers on all of them, using Windows 10. – zamir Jan 31 '20 at 14:04
  • @Chris For me it's consistent. I tried `iter` five more times, took 5.98, 5.84, 5.85, 5.87, 5.86. Then `reversed` five more times, took 14.3, 14.4, 14.4, 14.5, 14.3. – Kelly Bundy Jan 31 '20 at 14:12
  • @Chris So closing this because "can no longer be reproduced" is simply wrong. – Kelly Bundy Jan 31 '20 at 14:13
  • 1
    @HeapOverflow, _you_ may be able to reproduce it, but if _we_ can't there's not much we can do. FWIW, I voted to close as "needs more detail". – ChrisGPT was on strike Jan 31 '20 at 14:16
  • 8.63 µs (iter)vs 8.65 µs(reversed) – kederrac Jan 31 '20 at 14:16
  • @Chris Ok, what kind of detail shall I add? – Kelly Bundy Jan 31 '20 at 14:17
  • 2
    @HeapOverflow, I'm not sure. I know this is frustrating; it's frustrating for me too. I'd love to tell you "run command `x` and show me the output"... Can you reproduce on other machines? With other versions of Python? Have you tried in a clean virtualenv? – ChrisGPT was on strike Jan 31 '20 at 14:18
  • @MisterMiyagi You mean the memory locations of the *references* to the list's elements, right? I mean the memory locations of the *elements* themselves. Like in [this question](https://stackoverflow.com/q/42107442/12671057), where the order of the elements mattered. – Kelly Bundy Jan 31 '20 at 14:25
  • @Chris I added a few more details on the bottom. I don't have other machines right now and I haven't tried other Python versions (kinda don't want to install others just for this). I don't know about virtualenv. – Kelly Bundy Jan 31 '20 at 14:32
  • @MisterMiyagi No, as you can see in that linked question and its answers, the elements *do* matter. It doesn't just copy references, but also increases the reference counters in all elements. – Kelly Bundy Jan 31 '20 at 14:34
  • @HeapOverflow, this probably isn't relevant, but is there a reason you're using 32-bit Python on a 64-bit OS? – ChrisGPT was on strike Jan 31 '20 at 14:40
  • @Chris That's just python.org's default (at least for Windows/me). I didn't even know they have a 64-bit version there until I went searching for it now. Unless I have a good reason to override the default, I don't. – Kelly Bundy Jan 31 '20 at 14:45
  • @HeapOverflow The webpage cannot know if your computer is 64-bit or not. It is definitely not the "default" in the sense you mean. Anyway, it should not make a big difference. – Acorn Jan 31 '20 at 14:49
  • 6
    "*kinda don't want to install others just for this*" If you are the only one that can reproduce it yet you don't want to do the work, don't expect others to do it for you. – Acorn Jan 31 '20 at 14:51
  • @Acorn My browser is 64-bit and identifies itself as such in the user agent string. I believe this can be used and *is* used at least by some software/sites to offer me the proper 32 or 64 bit version. No? – Kelly Bundy Jan 31 '20 at 14:53
  • @HeapOverflow Some browsers give it, yes, but not all, and many users run extensions to avoid fingerprinting anyway. – Acorn Jan 31 '20 at 14:58
  • @Acorn For me, that "do the work" would be to install another (maybe several) Python version, then clean up again and hope I didn't affect my current configuration. For others, it's running two quick timeit commands. I think those works are not equivalent and it's reasonable to first wait some more. I might go through the hassle later. – Kelly Bundy Jan 31 '20 at 15:08
  • 2
    @HeapOverflow My point is that several others have tried and already told you they cannot reproduce it, so *someone* will have to install several Python versions to solve it. In any case, if I may suggest, I would recommend working on improving your workflow if installing new versions is something that can break your working environment. – Acorn Jan 31 '20 at 15:16
  • @Acorn Maybe, maybe not. Maybe someone else will be able to reproduce it right away and maybe someone will even be able to explain it right away. – Kelly Bundy Jan 31 '20 at 15:24
  • I just tried Python 3.8.0 installed via `pyenv` on Arch Linux. Still can't reproduce it. – ChrisGPT was on strike Jan 31 '20 at 18:02
  • @Chris Thanks. And I remembered I do have another PC here and just tested it. Can reproduce it there as well, see bottom of the question. Will try 64-bit Python on it next. – Kelly Bundy Jan 31 '20 at 22:29
  • @Chris As just added in a second update, there still appears to be a difference in the 64-bit version as well, but not much. – Kelly Bundy Jan 31 '20 at 22:52
  • @Acorn I just tried a different machine and same+different Python. See updates on the bottom of the question. – Kelly Bundy Jan 31 '20 at 22:54
  • 1
    @HeapOverflow, I've now managed to reproduce this with 32 bit Python on 64 bit Windows, but I still don't see any discrepancy on Linux, even with 32 bit Python. I've also confirmed that the 64-bit installer for Windows isn't very easy to find while the 32-bit installer is displayed prominently, suggesting that most people who download Python for Windows are likely to get the 32-bit version. – ChrisGPT was on strike Feb 01 '20 at 13:28
  • @Chris Thanks for all that! What times do you get for 32 bit Python on 64 bit Windows, i.e., how much do they differ? I wonder whether it's a genuine 32 bit issue or just the 32 bit version was compiled with fewer optimizations. Especially now that you can't reproduce it with 32 bit Python on Linux. – Kelly Bundy Feb 01 '20 at 13:44
  • @HeapOverflow, I did it on a throwaway VM from modern.ie, but I think the times were about 3× different. Something like 5 µs to 15 µs for the `range()` / `list()` example. – ChrisGPT was on strike Feb 01 '20 at 14:14
  • You might want to get in touch with whoever does the Windows builds of Python. Maybe they did something weird with this particular build. I don't see anything in the source that could be causing the problem. Aside from that, you might have to disassemble the executable and read up on stuff like prefetching behavior. – user2357112 Feb 01 '20 at 20:16
  • [The Windows installers are handled by Steve Dower](https://www.python.org/dev/peps/pep-0569/), apparently. It might be best to ask on one of the mailing lists (probably python-dev) and/or file a bug report before trying to reach out to Steve directly. – user2357112 Feb 01 '20 at 20:28
  • @user2357112supportsMonica Thanks. I'll probably file a bug report. Doing some more tests now. – Kelly Bundy Feb 01 '20 at 20:37
  • Hypothesis: your versions of python are compiled with different levels of optimization – Mad Physicist May 21 '22 at 20:30
  • @MadPhysicist Maybe. I just used the installers from python.org. – Kelly Bundy May 21 '22 at 20:33

2 Answers2

2

Unable to reproduce a significant difference

I've tried this on Python 3.9 and Python 3.11 using the stock macOS builds from python.org. The timings show that forward and reverse iteration run at close to the same speed:

% python3.11 -m timeit -s "a = list(range(1000))" "list(iter(a))"
100000 loops, best of 5: 2.84 usec per loop
% python3.11 -m timeit -s "a = list(range(1000))" "list(reversed(a))"
100000 loops, best of 5: 2.85 usec per loop

% python3.9 -m timeit -s "a = list(range(1000))" "list(iter(a))"
100000 loops, best of 5: 2.87 usec per loop
% python3.9 -m timeit -s "a = list(range(1000))" "list(reversed(a))"
100000 loops, best of 5: 2.91 usec per loop

Source code

Examining the C source shows that the code for list_iterator and list_reverseiterator are almost identical.

The expectation is that the reversed iterator will be very slightly slower for two reasons:

  1. The loop termination logic for reversed has two conditions index>=0 && index < PyList_GET_SIZE(seq) instead of the one condition for forward iteration it->it_index < PyList_GET_SIZE(seq).

  2. The memory access controllers on some CPUs will issue automatic prefetches when traversing an array in a forward direction but not in reverse. See §12.1 Optimizing Caching in Agner Fog's "Optimizing subroutines in assembly" language.

Sources of Variability

Given the source code is almost identical, the cause of variability is likely due to the compilation and build process.

  1. Perhaps the Windows compiler can better optimize forward iteration with its simpler loop termination condition.

  2. A more likely cause is that the Windows build is using Profile Guided Optimization (PGO) to inform the code generation step. This is very sensitive to how the profile is built. If the profiled code made little or no use of the reverse iterator, then it will not benefit from PGO.

The usual recommendation here is to rerun PGO using code that you care about rather than just the test suite. That will tune the build to your specific needs.

Better benchmark

Consecutive integers created by range() tend to be small objects and laid out in consecutive memory locations.

Also, Tte timing for a small number of iterations tends to also include the overhead of the calls to list, iter, and reversed.

For a better benchmark, I would use a larger, well shuffled dataset consisting of more interesting objects. Also, I would skip including list() in the loop so the timing can focus just on the iteration rather than on the list building.

from timeit import repeat

setup = '''
from random import randrange
from random import randrange, shuffle
data = [str(randrange(100_000)) for i in range(10_000)]
shuffle(data)
'''

forward = 'for x in iter(data): pass'

backward = 'for x in reversed(data): pass'

Here are two runs:

>>> min(repeat(forward, setup, repeat=7, number=100_000))
4.0327267500106245

>>> min(repeat(backward, setup, repeat=7, number=100_000))
4.094246666994877
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 1
    I see no significant difference on Linux either (Python 3.8.10; system Python on Linux Mint 20.3). – Karl Knechtel May 21 '22 at 19:53
  • Yes, reverse being slightly slower is what I'd expect as well, and the table at the bottom of my question shows it was the case for 7 out of the 8 Python versions I tested. Only in one of them was reversed *significantly* slower (the one where I originally noticed this). Regarding better benchmark: I don't remember, but might've used unshuffled `range` *intentionally*, in order to keep the overhead low. And like I said, I did try reversing the list beforehand to remove the reverse iteration's potential disadvantage of visiting the objects (or rather their reference counts) in reverse ... – Kelly Bundy May 21 '22 at 21:20
  • ... memory order, and it made no difference. The size was also sufficiently large to clearly show the significant speed difference, so I see no reason to change that, either. What I *would* do differently today is to consume with `deque` with `maxlen=0`, to further reduce the overhead and get as close to pure iteration as I can get. And perhaps I'd use `[0, 1, 2, 3] * 250` instead, to further reduce overhead. Your larger perhaps cache-exceeding data, shuffling, and `for` loops all just add overhead that dilutes potential speed differences of the pure iteration. – Kelly Bundy May 21 '22 at 21:20
-1

I've been watching on python documentation and found this:

If the __reversed__() method is not provided, the reversed() built-in will fall back to using the sequence protocol (__len__() and __getitem__()). Objects that support the sequence protocol should only provide __reversed__() if they can provide an implementation that is more efficient than the one provided by reversed().

Probably is because of that.

Im not an expert but i saw the question is 7 months without answer and tryied to give you one based on python documentation.

Those are the resouces i used:

https://docs.python.org/3/library/functions.html#reversed https://docs.python.org/3/reference/datamodel.html#object

  • 3
    Not sure what you mean. Lists do provide a `__reversed__` method. And as I said and showed, both `iter` and `reversed` do result in specialized iterators for lists. – Kelly Bundy Sep 09 '20 at 00:11