-2

I am trying to create a custom function that improves the itertools pairwise function.

Unlike the pairwise function, which returns pairs of items (n=2), I would like to be able to specify the grouping length (n=x). So that I can return groups of items of a specified length. Is this possible?

So when n=2, the groupwise function should be equivalent to the itertools pairwise function. But when n=3, I would expect it to return groups of 3.

This is what I have so far - which is essentially the same as the pairwise function. ie. works for n=2.

Code:

from itertools import tee

my_string = 'abcdef'

def groupwise(iterable, n):
    a = tee(iterable, n)
    next(a[n-1], None)
    return zip(a[0], a[1])

for item in groupwise(my_string, 2):
    print("".join(item))

Output:

ab
bc
cd
de
ef

I am trying to modify the code above to accept n=3, n=4, n=5, ... n=x, but I cannot think of a way to provide the undetermined number of components to the zip function considering that I would like to specify any value of n <= len(my_string)



When n=3 the output should be:

abc
bcd
cde
def
ScottC
  • 3,941
  • 1
  • 6
  • 20
  • Try `zip(*a[:n])`? – Julien Dec 21 '22 at 05:18
  • `zip(*a[:n])` returns: `aab, bbc, ccd, dde, eef`. Unless I have to change something else ? – ScottC Dec 21 '22 at 05:21
  • Before it was added to `itertools`, `pairwise` was part of `more-itertools`. You can still find it there, along with `triplewise` and other windowing functions. Functions `windowed` or `sliding_window` might be the ones you want. https://more-itertools.readthedocs.io/en/stable/api.html#windowing – Stef Dec 21 '22 at 10:39
  • You may find this ```grouper``` from more_itertools performs better than home-grown one. – Daniel Hao Dec 21 '22 at 12:45
  • `grouper` does not quite do what I need it to do. – ScottC Dec 21 '22 at 13:24
  • 1
    Of course it's possible. Why wouldn't it be? And isn't this simply `sliding_window` from Python's own [itertools recipes](https://docs.python.org/3/library/itertools.html#itertools-recipes)? – Kelly Bundy Dec 21 '22 at 13:26
  • Thanks Kelly, yes - it appears that `sliding_window` and `windowed` from `more_itertools` are the functions that I am looking for. I did not realise these functions existed. Hence my efforts. Thank you. – ScottC Dec 21 '22 at 13:32

2 Answers2

3

tee isn't going to scale in this way, unfortunately. Writing pairwise takes one tee call, and if you want to do it for N elements in each group, you'll need N-1 tee calls.

Fortunately, we can do better by just rolling the loop ourselves. What we're going to do is keep track of the last N elements that we've seen ourselves and yield them whenever we have a group that's big enough. To do that, we need a data structure that can efficiently add things to the right and subtract them from the left. collections.deque will do nicely. In fact, it even has a maxlen parameter that will automatically subtract from the left exactly when we need it to.


from collections import deque

def groupwise(iterable, n):
    accum = deque((), n)
    for element in iterable:
        accum.append(element)
        if len(accum) == n:
            yield tuple(accum)

Construct an empty deque (the first constructor argument is the initial elements: an empty tuple) whose capacity is n. When we have more than n elements added to the right side, it will automatically drop the leftmost one. We could do this by hand, but deque will do it for us so we might as well utilize the functionality they gave us.

Then iterate over the iterable. Append each element to the end of the deque (dropping from the left as needed to satisfy the maximum length requirement). Then, if the deque has n elements, we yield it. That makes our function a generator function, so it will produce a new iterator consisting of all of the yielded values.

We make a shallow copy of our deque so we're not just yielding the same (mutable) deque object each time. We also make that shallow copy a tuple, rather than a deque, to be more in line with other itertools functionality.

To call, use

print(list(groupwise(range(10), 3)))
# Output: [(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9)]
Silvio Mayolo
  • 62,821
  • 6
  • 74
  • 116
  • 1
    You don't need `if len(accum) == n:` if you start with a full deque instead of an empty one: `iterator = iter(iterable); accum = deque(islice(iterator, n), maxlen=n); yield tuple(accum); for x in iterator: accum.append(x); yield tuple(accum)`. (where `islice` is imported from `itertools`). I haven't timed it, but I suspect that if `len(iterator)` is much larger than `n`, avoiding this test can make it faster. – Stef Dec 21 '22 at 10:45
  • What do you mean with *"isn't going to scale in this way,"*? – Kelly Bundy Dec 21 '22 at 13:33
  • 1
    And do we really *"need a data structure that can efficiently add things to the right and subtract them from the left"*? What good is O(1) update of the deque if you spend O(n) on building a tuple from it anyway? – Kelly Bundy Dec 21 '22 at 13:38
  • @Stef That's buggy, you yield something even if the iterable is shorter than n. – Kelly Bundy Dec 21 '22 at 13:40
  • @KellyBundy Indeed, I should have said `if len(accum) == n: yield tuple(accum)` before the loop. Regarding your comment on the deque: there is no way to avoid the O(n) generation of the tuple anyway. "Efficiently add to the right and remove from the left" is perhaps an overstatement, but you need to store the elements somehow, so why not a deque. The [code for `more_itertools.sliding_window`](https://more-itertools.readthedocs.io/en/stable/_modules/more_itertools/recipes.html#sliding_window) uses a deque too, although I don't know what other different solutions they benchmarked before choosing – Stef Dec 21 '22 at 14:23
  • 2
    @Stef Or just initialize with n-1 instead of n and remove the extra yield. I'm actually surprised that neither the itertools recipe nor more-itertools do that. Sure, deque works, I just disagree particularly with the "need" for its efficient modification here, when that's overshadowed anyway. Using a tuple instead of a deque might even be faster. – Kelly Bundy Dec 21 '22 at 15:05
  • 2
    @Stef [tuple faster than deque](https://tio.run/##rVJLasMwEN3rFNpFat0QJ5sSyBV6gVKMbI/TAf2iyOD08u5IckI/ZFGoQbZ4n5k3kv0lvju7e/ZhnofgDO@c1tBFdPbM0XgXIu/hNALLLEYI0Tl94/CssVvIiAYwXpkAHlRkrIchVRDJqloNFbdyzziV4odc78ZIQlXXjYaI3FOU6iQg01Mtk5U0gwt84mjJnQotprXyHmwvJpmxC4LueRy9BpF5WaIQUv8xy4uzpHz4EeZuEHLk72u9f@OPXEzV10SZukXZ/nuUD/SEy/t5pt9hrp1IqfEcRVD2CKLe0CMlswSnPWOpSZOaFMEud0ngkEC6siofb35vS4I0jUEryt8gtDJtr/bL9Q7fpq/4htbq2K3BJkzIFcGjaSEcalnO0Ae0UdDYw7pprDLQNIkosJznTw) with 10000 elements and n=1000. – Kelly Bundy Dec 21 '22 at 15:23
  • 2
    @KellyBundy You could create an issue or pull request on https://github.com/more-itertools/more-itertools – Stef Dec 21 '22 at 15:27
1

As per Stef's comments, there are some other alternative answers. Neither are significantly different from Silvio's answer, but I thought to include them here for completeness:

Stef's code:

from itertools import islice
from collections import deque

def groupwise(iterable, n):
    iterator = iter(iterable); 
    accum = deque(islice(iterator, n-1), maxlen=n); 
    
    for x in iterator: 
        accum.append(x); 
        yield tuple(accum)


Kelly Bundy's Code:

from itertools import islice

def groupwise(iterable, n):
  it = iter(iterable)
  accum = None, *islice(it, n-1)
  for x in zip(it):
    accum = accum[1:] + x
    yield accum


rici's Code:

from itertools import islice
from collections import deque

def groupwise(iterable, n):     
    it = iter(iterable)     
    if n > 0 and len(accum := deque(islice(it, n-1), maxlen=n)) == n - 1:         
        return (accum.append(v) or tuple(accum) for v in it)      
    else:         
        return ()


rici v2 Code:

from itertools import islice
from collections import deque

def groupwise(iterable, n): 
    it = iter(iterable) 
    if n < 1: 
        return () 
    accum = deque(islice(it, n-1), maxlen = n) 
    return (accum.append(v) or tuple(accum) for v in it)


Using Windowed:

As per Stef's comment, there is a function within more-itertools called windowed that provides the required functionality:

from more_itertools import windowed

my_string = 'abcdef'

for item in windowed(my_string, 3):
    print("".join(item))


ScottC's version adapted from Windowed:

Borrowing a portion of the windowed code, and taking advice from Kelly Bundy, Stef and rici

from itertools import islice
from collections import deque

def groupwise(iterable, n): 
    if n < 1:
        return None

    it = iter(iterable) 
    accum = deque(islice(it, n-1), maxlen=n)
        
    for _ in map(accum.append, it):
        yield tuple(accum)

print(list(groupwise('abcde',3)))

Output:

[('a', 'b', 'c'), ('b', 'c', 'd'), ('c', 'd', 'e')]
ScottC
  • 3,941
  • 1
  • 6
  • 20
  • Since you're working with iterables, there is no guarantee that they are indexable. That is, `iterable[0:n-1]` might crash. I recommend using `itertools.islice` instead. – Stef Dec 21 '22 at 15:17
  • Updated - I think I covered it all now ? – ScottC Dec 21 '22 at 15:29
  • 1
    @Stef Likely the `len` call would crash instead. – Kelly Bundy Dec 21 '22 at 15:31
  • @Kelly: You can use `None` as the stop parameter to `itertools.islice`, which will have precisely the desired effect. What you can't do is stop `k` items before the end of the iterable, for values of `k` other than zero. – rici Dec 21 '22 at 16:00
  • @rici Why are you saying that? – Kelly Bundy Dec 21 '22 at 16:03
  • @kelly: Because I was looking at the call to `len` as an argument to `islice`, which is unnecessary. Evidently, you can get rid of the use of `len` at the beginning, and then check `len(accum)` after you create it. – rici Dec 21 '22 at 16:04
  • Thanks rici - have updated, and the `len` calls have been removed successfully due to your input. When `n > len(iterable)`, the `groupwise` function returns `None`. – ScottC Dec 21 '22 at 16:12
  • @rici Yes, I meant the `len(iterable)` in the answer's code. Responding to Stef's comment about non-indexable iterables, which usually also don't support `len`, and that `len` call came before the now-deleted slicing that Stef talked about, so the `len` call would be the crash reason instead. Yes, length can be checked afterwards, although I don't see how using `None` as `stop` would work. And the length check is pointless anyway, if you initialize with n-1 instead of n elements (using n-1 and removing an extra yield was my advice that the answer mentions). – Kelly Bundy Dec 21 '22 at 16:15
  • @ScottC Now the last one is broken, for example `list(groupwise(iter('abcde'), 3))` gives `[('a', 'b', 'e')]`. – Kelly Bundy Dec 21 '22 at 16:18
  • Try: `print(list(groupwise('abcde',3)))` – ScottC Dec 21 '22 at 16:20
  • @Kelly: I apologise for posting code in a comment but as this question is closed, I can't do anything else. Here's my suggestion: `def groupwise(iterable, n): it = iter(iterable) if n > 0 and len(accum := deque(islice(it, n-1), maxlen=n)) == n - 1: return (accum.append(v) or tuple(accum) for v in it) else: return () ` – rici Dec 21 '22 at 16:21
  • @ScottC Why should I try that? – Kelly Bundy Dec 21 '22 at 16:22
  • That avoids the second call to islice altogether. But my original thought was to replace the *second* call with `islice(iterable, n-1, None)`. Not the first call. – rici Dec 21 '22 at 16:22
  • @rici Ah, I hadn't even noticed that second `islice` before. Partly because my phone screen is too narrow to show the whole solution, partly because I never fully looked at that solution, and partly because I wouldn't have expected that second `islice` because it's just wrong (loses elements if the iterable is an iterator). In my opinion, "Stef's code" is the best when using deque (now that it initializes with n-1). It's closed as duplicate, btw, so if you have a good alternative, you can post under that other question :-) – Kelly Bundy Dec 21 '22 at 16:29
  • @rici Yes, and now if you change the genexp to an ordinary loop, making the function a generator function, you don't need the `or` hack. And I think it'd be more readable. – Kelly Bundy Dec 21 '22 at 16:39
  • ScottC:; To fix the problem @kelly points out, you need `it = iter(iterable)`, as in my code, and then you don't need the second `islice`. So that also avoids iterating the first n-1 items twice. – rici Dec 21 '22 at 16:46
  • @rici How could it be faster? It's doing the exact same stuff, except you're doing an additional truth check and (non-)jump. (Oh and you probably have a load_deref instead of a load_fast, but that's pretty much the same speed now, I think.) – Kelly Bundy Dec 21 '22 at 16:46
  • @kelly: I guess you're right. There's not much difference and it varies by the length of the iteration, but most of the time your way is a little faster, according to a quick unscientific benchmark. – rici Dec 21 '22 at 16:57
  • @rici [diff of generator expression vs generator function](https://www.diffchecker.com/QKK2eFfn) taken from [this](https://tio.run/##tY9NbsIwEIX3OcXbYVeAhFhQoXKYNJ7ASM7EdWwop08nNESA2LBgJG/ez3i@cE6HVtafIfY9N6GNCY67onBUYx/bHE7ckeFEsfz2NIfYLQrocMIOgz6ZdjRqCL6wGnPDREo5CsyYKKsqN9p29JN1d@e5Gr7Q5YuVnaMpfz2J@jLmr/VLb1mGQOLM0aKNSDl4@jcsahWOYNGzrBJwt9RnJgrVQmRJZraY4QObjX0L5ouIl/jN5dtp3QPupJ@ZvLsnf0bb938). As expected, they're *identical* except for you doing the additional/slower stuff I mentioned. (Granted that's Python 3.8, not sure about newer versions). – Kelly Bundy Dec 21 '22 at 17:03
  • @kelly: ok, I'll try to figure out why I'm seeing speed differences (albeit minor) when testing on v3.11. You're certainly right about the hackishness of `return (q.append(v) or tuple(q) for v in it)`. Unfortunately, there is no `appended(q, v)` which returns `q` after appending `v`; apparently wanting to do that is unpythonic. Guilty as charged, which is what lead me to that unfortunate idiom. But you could also return a generator using the possibly less hackish `return (tuple(q) for _ in map(q.append, it))`, which might be my last take on this distracting problem. – rici Dec 22 '22 at 06:33
  • is that for version 1 or version 2 - I can update it if you want ? – ScottC Dec 22 '22 at 06:38
  • @rici Have you tried diffing the `dis` results in 3.11? Maybe there are other differences now. Although I rather doubt it. There's no `appended`, but there are [`__iadd__` and `iconcat`](https://tio.run/##xZFBTsQwDEX3OYWXCSpIaCSEkDhLFVLPYCmNg@togMsXN4WyZkVWkb/9/ndSP/SVy@mxyrqehWcgRVHmvADNlUWBlkwJBxCsGNX1psQ5Y1LicrRN@NZwV7miRGU5CIlLslHnJjzDRbjVKy3oN6v4ko1dwhM4sEMKzz3CIYZejym12aTu4vdI1mKTt/dhgDm@Zyyml7BzBLVJsXr12upmsV075W4cKU7TOA7wSdUgIQTnqlBRf/MbTmK5oH8w@GnT/zv69xv@fMO@SvjDCuv6BQ). – Kelly Bundy Dec 22 '22 at 08:55