There's a nice example in the Python 2.6 itertools
docs that shows how to find consecutive sequences. To quote:
Find runs of consecutive numbers using groupby
. The key to the
solution is differencing with a range so that consecutive numbers all
appear in same group.
For some strange reason, that example is not in the later versions of the docs. That code works for sequences of numbers, the code below shows how to adapt it to work on letters.
from itertools import groupby
s = 'jaghiuuabc'
def keyfunc(t):
''' Subtract the character's index in the string
from its Unicode codepoint number.
'''
i, c = t
return ord(c) - i
a = []
for k, g in groupby(enumerate(s), key=keyfunc):
# Extract the chars from the (index, char) tuples in the group
seq = [t[1] for t in g]
if len(seq) > 1:
a.append(''.join(seq))
print(a)
output
['ghi', 'abc']
How it works
The heart of this code is
groupby(enumerate(s), key=keyfunc)
enumerate(s)
generates tuples containing the index number and character for each character in s
. For example:
s = 'ABCEF'
for t in enumerate(s):
print(t)
output
(0, 'A')
(1, 'B')
(2, 'C')
(3, 'E')
(4, 'F')
groupby
takes items from a sequence or iterator and gathers adjacent equal items together into groups. By default, it simply compares the values of the items to see if they're equal. But you can also give it a key function. When you do that, it passes each item to the key function and uses the result returned by that key function for its equality test.
Here's a simple example. First, we define a function div_by_10
that divides a number by 10, using integer division. This basically gets rid of the last digit in the number.
def div_by_10(n):
return n // 10
a = [2, 5, 10, 13, 17, 21, 22, 29, 33, 35]
b = [div_by_10(u) for u in a]
print(a)
print(b)
output
[2, 5, 10, 13, 17, 21, 22, 29, 33, 35]
[0, 0, 1, 1, 1, 2, 2, 2, 3, 3]
So if we use div_by_10
as the key function to groupby
it will ignore the last digit in each number and thus it will group adjacent numbers together if they only differ in the last digit.
from itertools import groupby
def div_by_10(n):
return n // 10
a = [2, 5, 10, 13, 17, 21, 22, 29, 33, 35]
print(a)
for key, group in groupby(a, key=div_by_10):
print(key, list(group))
output
[2, 5, 10, 13, 17, 21, 22, 29, 33, 35]
0 [2, 5]
1 [10, 13, 17]
2 [21, 22, 29]
3 [33, 35]
My keyfunc
receives a (index_number, character) tuple and subtracts that index_number from the character's code number and returns the result. Let's see what that does with my earlier example of 'ABCEF'
:
def keyfunc(t):
i, c = t
return ord(c) - i
for t in enumerate('ABCEF'):
print(t, keyfunc(t))
output
(0, 'A') 65
(1, 'B') 65
(2, 'C') 65
(3, 'E') 66
(4, 'F') 66
The code number for 'A' is 65, the code number for 'B' is 66, the code number for 'C' is 67, etc. So when we subtract the index from the code number for each of 'A', 'B', and 'C' we get 65. But we skipped over 'D' so when we do the subtractions for 'E' and 'F' we get 66. And that's how groupby
can put 'A', 'B', & 'C' in one group and 'E' & 'F' in the next group.
This can be tricky stuff. Don't expect to understand it all completely straight away. But if you do some experiments yourself I'm sure it will gradually sink in. ;)
Just for fun, here's the unreadable multiply-nested list comprehension version of that code. ;)
print([z for _, g in groupby(enumerate(s),lambda t:ord(t[1])-t[0])for z in[''.join([*zip(*g)][1])]if len(z)>1])
Here's another version which was inspired by Amit Tripathi's answer. This one doesn't use any imports because it does the grouping manually. prev
contains the codepoint number of the previous character. We initialize prev
to -2 so that the first time the if i != prev + 1
test is performed it's guaranteed to be true because the smallest possible value of ord(ch)
is zero, so a new empty list will be added to groups
.
s = 'jaghiuuabcxyzq'
prev, groups = -2, []
for ch in s:
i = ord(ch)
if i != prev + 1:
groups.append([])
groups[-1].append(ch)
prev = i
print(groups)
a = [''.join(u) for u in groups if len(u) > 1]
print(a)
output
[['j'], ['a'], ['g', 'h', 'i'], ['u'], ['u'], ['a', 'b', 'c'], ['x', 'y', 'z'], ['q']]
['ghi', 'abc', 'xyz']