re can be used to capture globally as well:
>>> s = 'The 7 quick brown foxes jumped 7 times over 7 lazy dogs'
>>> sep = '7'
>>>
>>> [i for i in re.split(f'({sep}[^{sep}]*)', s) if i]
['The ', '7 quick brown foxes jumped ', '7 times over ', '7 lazy dogs']
If the f-string is hard to read, note that it just evaluates to (7[^7]*)
.
(To the same end as the listcomp one can use list(filter(bool, ...))
, but it's comparatively quite ugly)
In Python 3.7 and onward, re.split()
allows splitting on zero-width patterns. This means a lookahead regex, namely f'(?={sep})'
, can be used instead of the group shown above.
What's strange about this is the timings: if using re.split()
(i.e. without a compiled pattern object), the group solution consistently runs about 1.5x faster than the lookahead. However, when compiled, the lookahead beats the other hands-down:
In [4]: r_lookahead = re.compile('f(?={sep})')
In [5]: r_group = re.compile(f'({sep}[^{sep}]*)')
In [6]: %timeit [i for i in r_lookahead.split(s) if i]
2.76 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit [i for i in r_group.split(s) if i]
5.74 µs ± 65.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit [i for i in r_lookahead.split(s * 512) if i]
137 µs ± 1.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [9]: %timeit [i for i in r_group.split(s * 512) if i]
1.88 ms ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A recursive solution also works fine, although more slowly than splitting on a compiled regex (but faster than a straight re.split(...)
):
def splitkeep(s, sep, prefix=''):
start, delim, end = s.partition(sep)
return [prefix + start, *(end and splitkeep(end, sep, delim))]
>>> s = 'The 7 quick brown foxes jumped 7 times over 7 lazy dogs'
>>>
>>> splitkeep(s, '7')
['The ', '7 quick brown foxes jumped ', '7 times over ', '7 lazy dogs']