5

Possibly a duplicate, but I couldn't find anything.

I have a very long iterator (10000 items) and I need to iterate over it ~500 items at a time. So if my iterator was range(10000), it would look like this:

Iteration #1: 0, 1, 2, ... 497, 498, 499
Iteration #2: 1, 2, 3, ... 498, 499, 500
Iteration #3: 2, 3, 4, ... 499, 500, 501
Iteration #4: 3, 4, 5, ... 500, 501, 502
...
Iteration #9500: 9499, 9500, 9501 ... 9996, 9997, 9998
Iteration #9501: 9500, 9501, 9502 ... 9997, 9998, 9999

and so on. There is this method:

def nwise_slice(lst, n):
    for i in range(len(lst) - n + 1):
        yield lst[i:i + n]

However, this doesn't work with lazy iterators. I tried to create a solution using iterators and adapted from the itertools pairwise and consume recipes (see here) to create this:

import itertools

def nwise_iter(lst, n):
    iters = itertools.tee(lst, n)
    for idx, itr in enumerate(iters):
        next(itertools.islice(itr, idx, idx), None)

    for group in zip(*iters):
        yield group

which does the same (albeit yielding a tuple rather than a list, which does not matter to me). I also believe it doesn't create a lot of unnecessary slices. This solution works on non-sliceable iterators, like files (which I plan to work with). However, the itertools solution was 2x slower:

In [4]: %timeit list(nwise_slice(list(range(10000)), 500))
46.9 ms ± 729 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit list(nwise_iter(list(range(10000)), 500))
102 ms ± 3.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I don't want to have to load all of my test data into memory to take advantage of the slice method. Is there a more efficient way to pull this off?

iz_
  • 15,923
  • 3
  • 25
  • 40
  • Check out the `grouper()` function in this [answer](https://stackoverflow.com/a/4356415/355230) of mine. – martineau Jan 20 '19 at 19:52
  • 1
    @martineau Thanks. Unfortunately, that iterates `(1, 2, 3), (4, 5, 6)` instead of `(1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)` – iz_ Jan 20 '19 at 19:53
  • Do you need the iterator’s results to remain valid when the iteration is advanced? Do you need O(1) access by index to each subsequence? – Davis Herring Jan 20 '19 at 20:48
  • @DavisHerring No, all of that is unneeded. I can afford to "mess up" the original iterator. – iz_ Jan 20 '19 at 20:51
  • You can use a sliding window algorithm. Not faster, but more succinct, look at the [`more_itertools` library](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.windowed) `list(more_itertools.windowed(range(10000), 500))`. – pylang Feb 01 '19 at 19:00

1 Answers1

4

What about using a deque to "memoize" your items?

from collections import deque

def nwise_slice(it, n):
    deq = deque((), n)
    for x in it:
        deq.append(x)
        if len(deq)==n: yield deq

my_range = range(8)
for sub in nwise_slice(my_range, 5):
    print(sub)
# =>
# deque([0, 1, 2, 3, 4], maxlen=5)
# deque([1, 2, 3, 4, 5], maxlen=5)
# deque([2, 3, 4, 5, 6], maxlen=5)
# deque([3, 4, 5, 6, 7], maxlen=5)
Davis Herring
  • 36,443
  • 4
  • 48
  • 76
Walter Tross
  • 12,237
  • 2
  • 40
  • 64
  • Thanks for the answer. Fortunately, it's extremely fast. However, I'm getting incorrect results - it yields the same deque each time. (Also, `lst[0:n]` will fail on an iterator, but I think this can be fixed with `islice`?). – iz_ Jan 20 '19 at 20:58
  • I guess the main problem is that I'm somewhat of a beginner, with Python. Anyway, I added my test code, maybe you are able to figure out whether I'm using it incorrectly, or where the difference between my use case and yours lies. – Walter Tross Jan 20 '19 at 21:24
  • Interesting: iterating over it is fine, but calling `list(nwise_slice(my_range, 5))` breaks it. Hmm... – iz_ Jan 20 '19 at 21:24
  • well, `list()` wants a generator of items, not a generator of iterators like `nwise_slice()` – Walter Tross Jan 20 '19 at 21:29
  • Thanks for the insight. I modified your solution a bit to work for iterators. – iz_ Jan 20 '19 at 21:36
  • Glad it was useful nonetheless! – Walter Tross Jan 20 '19 at 21:38
  • @Tomothy32: Using `list` is useless, since every result is the same object `deq`. That’s why I asked about invalidating the previous results… – Davis Herring Jan 20 '19 at 22:35
  • The change I would suggest is to replace `yield deq` with `yield list(deq)` to return a copy – Dan D. Jan 20 '19 at 23:50
  • @DanD. this doesn't seem to be necessary, unless you want to make sure `deq` is not altered by code you don't control – Walter Tross Jan 21 '19 at 07:53
  • 1
    @WalterTross Without it `list(nwise_slice(x, 5))` would be a list of the same object. This is unexpected. One can get away with linear use of the result as with `itertools.groupby` but as with that you have to know that the generator does not return a copy and make one yourself `[(k, list(g)) for k,g in groupby(...)]` as the object or iterator is no longer valid in the next iteration. Instead of copying, one could wrap the return result in a invalidate-able proxy that would become invalid when control returns to the generator. – Dan D. Jan 21 '19 at 20:45