string.split()
returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

- 72,339
- 21
- 134
- 141
-
5[This question](http://stackoverflow.com/questions/3054604/) might be related. – Björn Pollex Oct 05 '10 at 08:51
-
1The reason is that it's very hard to think of a case where it's useful. Why do you want this? – Glenn Maynard Oct 05 '10 at 09:02
-
12@Glenn: Recently I saw a question about splitting a long string into chunks of n words. One of the solutions `split` the string and then returned a generator working on the result of `split`. That got me thinking if there was a way for `split` to return a generator to start with. – Manoj Govindan Oct 05 '10 at 09:07
-
6There is a relevant discussion on the Python Issue tracker: http://bugs.python.org/issue17343 – saffsd Apr 19 '13 at 01:51
-
1@GlennMaynard it can be useful for really large bare string/file parsing, but anybody can write generator parser himself very easy using self-brewed DFA and yield – Dmitry Ponyatov Dec 05 '18 at 06:50
17 Answers
It is highly probable that re.finditer
uses fairly minimal memory overhead.
def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
Demo:
>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']
I have confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for
loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).
More general version:
In reply to a comment "I fail to see the connection with str.split
", here is a more general version:
def splitStr(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
# alternatively, more verbosely:
regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
for match in re.finditer(regex, string):
fragment = match.group(1)
yield fragment
The idea is that ((?!pat).)*
'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine). In pseudocode: repeatedly consume (begin-of-string
xor {sep}
) + as much as possible until we would be able to begin again (or hit end of string)
Demo:
>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>
>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']
>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']
>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']
>>> list(splitStr(' A b c. '))
['', 'A', 'b', 'c.', '']
(One should note that str.split has an ugly behavior: it special-cases having sep=None
as first doing str.strip
to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep="\s+"
.)
(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)'
on ',,,a,,b,c'
returns ['', '', '', 'a', '', 'b', 'c', '']
with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)
(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

- 88,546
- 24
- 137
- 145
-
8Excellent! I had forgotten about finditer. If one were interested in doing something like splitlines, I would suggest using this RE: '(.*\n|.+$)' str.splitlines chops off the trainling newline though (something that I don't really like...); if you wanted to replicated that part of the behavior, you could use grouping: (m.group(2) or m.group(3) for m in re.finditer('((.*)\n|(.+)$)', s)). PS: I guess the outer paren in the RE are not needed; I just feel uneasy about using | without paren :P – allyourcode Feb 12 '15 at 17:54
-
3What about performance? re matching should be slower that ordinary search. – anatoly techtonik Aug 10 '16 at 04:36
-
1How would you rewrite this split_iter function to work like `a_string.split("delimiter")`? – Moberg Nov 14 '16 at 12:46
-
split accepts regular expressions anyway so it's not really faster, if you want to use the returned value in a prev next fashion, look at my answer at the bottom... – Veltzer Doron Dec 18 '17 at 14:35
-
1`str.split()` does not accept regular expressions, that's `re.split()` you're thinking of... – alexis Mar 31 '18 at 13:43
-
If using a bold all-caps disclaimer that this "doesn't present an advantage in terms of memory", it would be nice to cite proof this method is O(N) memory, and not in fact the O(1) or O(log(N)) memory which I specifically tested for. – ninjagecko Feb 14 '19 at 09:41
-
-
@allyourcode: `splitlines` does not chop end of lines any more in Python3 when you call it with `keepends=True`. I guess you already noticed, but I added this for bystanders seeing your old comment above. – kriss Oct 27 '20 at 10:58
-
This is a nice solution but it doesn't actually have `sep` nor `maxsplit` arguments so I fail to see the connection with `str.split`... – Tomerikoo Jan 07 '21 at 19:31
-
@Tomerikoo: I added a more general version to address how you did not see the connection to `str.split()` in its general form. It was non-trivial, so thanks for pointing that out. – ninjagecko Jan 09 '21 at 07:31
-
Wow I didn't mean to send you off to work. That's an impressive edit. I'm sorry if my wording was a bit extreme, I really just meant that I would expect to see `sep` and `maxsplit` somewhere in there as the question is generally asking for a generator `split`, not specific word-split. – Tomerikoo Jan 09 '21 at 09:51
The most efficient way I can think of it to write one using the offset
parameter of the str.find()
method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.
[edit 2016-8-2: updated this to optionally support regex separators]
def isplit(source, sep=None, regex=False):
"""
generator version of str.split()
:param source:
source string (unicode or bytes)
:param sep:
separator to split on.
:param regex:
if True, will treat sep as regular expression.
:returns:
generator yielding elements of string.
"""
if sep is None:
# mimic default python behavior
source = source.strip()
sep = "\\s+"
if isinstance(source, bytes):
sep = sep.encode("ascii")
regex = True
if regex:
# version using re.finditer()
if not hasattr(sep, "finditer"):
sep = re.compile(sep)
start = 0
for m in sep.finditer(source):
idx = m.start()
assert idx >= start
yield source[start:idx]
start = m.end()
yield source[start:]
else:
# version using str.find(), less overhead than re.finditer()
sepsize = len(sep)
start = 0
while True:
idx = source.find(sep, start)
if idx == -1:
yield source[start:]
return
yield source[start:idx]
start = idx + sepsize
This can be used like you want...
>>> print list(isplit("abcb","b"))
['a','c','']
While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

- 8,375
- 2
- 34
- 38
Did some performance testing on the various methods proposed (I won't repeat them here). Some results:
str.split
(default = 0.3461570239996945- manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912
re.finditer
(ninjagecko's answer) = 0.698872097000276str.find
(one of Eli Collins's answers) = 0.7230395330007013itertools.takewhile
(Ignacio Vazquez-Abrams's answer) = 2.023023967998597str.split(..., maxsplit=1)
recursion = N/A†
†The recursion answers (string.split
with maxsplit = 1
) fail to complete in a reasonable time, given string.split
s speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway.
Tested using timeit
on:
the_text = "100 " * 9999 + "100"
def test_function( method ):
def fn( ):
total = 0
for x in method( the_text ):
total += int( x )
return total
return fn
This raises another question as to why string.split
is so much faster despite its memory usage.

- 7,726
- 3
- 46
- 59
-
2This is because memory is slower than cpu and in this case, the list is loaded by chunks where as all the others are loaded element by element. On the same note, many academics will tell you linked lists are faster and have less complexity while your computer will often be faster with arrays, which it finds easier to optimise. **You can't assume an option is faster than another, test it !** +1 for testing. – Benoît P Feb 12 '19 at 14:54
-
The problem arise in the next steps of a processing chain. If you then want to find an specific chunk and ignore the rest when you find it, then you have the justification to use a generator based split instead of the built-in solution. – jgomo3 Feb 17 '20 at 15:09
This is generator version of split()
implemented via re.search()
that does not have the problem of allocating too many substrings.
import re
def itersplit(s, sep=None):
exp = re.compile(r'\s+' if sep is None else re.escape(sep))
pos = 0
while True:
m = exp.search(s, pos)
if not m:
if pos < len(s) or sep is not None:
yield s[pos:]
break
if pos < m.start() or sep is not None:
yield s[pos:m.start()]
pos = m.end()
sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["
assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')
EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

- 2,234
- 1
- 15
- 7
-
12
-
@ErikKaplun Because the regex logic for the items can be more complex than for their separators. In my case, I wanted to process each line individually, so I can report back if a line failed to match. – rovyko Apr 30 '20 at 19:14
Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.
I'll just copy the docstring of the main str_split
function:
str_split(s, *delims, empty=None)
Split the string s
by the rest of the arguments, possibly omitting
empty parts (empty
keyword argument is responsible for that).
This is a generator function.
When only one delimiter is supplied, the string is simply split by it.
empty
is then True
by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if empty
is set to
True
, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
When no delimiters are supplied, string.whitespace
is used, so the effect
is the same as str.split()
, except this function is a generator.
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
import string
def _str_split_chars(s, delims):
"Split the string `s` by characters contained in `delims`, including the \
empty parts between two consecutive delimiters"
start = 0
for i, c in enumerate(s):
if c in delims:
yield s[start:i]
start = i+1
yield s[start:]
def _str_split_chars_ne(s, delims):
"Split the string `s` by longest possible sequences of characters \
contained in `delims`"
start = 0
in_s = False
for i, c in enumerate(s):
if c in delims:
if in_s:
yield s[start:i]
in_s = False
else:
if not in_s:
in_s = True
start = i
if in_s:
yield s[start:]
def _str_split_word(s, delim):
"Split the string `s` by the string `delim`"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
yield s[start:i]
start = i+dlen
except ValueError:
pass
yield s[start:]
def _str_split_word_ne(s, delim):
"Split the string `s` by the string `delim`, not including empty parts \
between two consecutive delimiters"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
if start!=i:
yield s[start:i]
start = i+dlen
except ValueError:
pass
if start<len(s):
yield s[start:]
def str_split(s, *delims, empty=None):
"""\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.
When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
"""
if len(delims)==1:
f = _str_split_word if empty is None or empty else _str_split_word_ne
return f(s, delims[0])
if len(delims)==0:
delims = string.whitespace
delims = set(delims) if len(delims)>=4 else ''.join(delims)
if any(len(d)>1 for d in delims):
raise ValueError("Only 1-character multiple delimiters are supported")
f = _str_split_chars if empty else _str_split_chars_ne
return f(s, delims)
This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:
def str_split(s, *delims, **kwargs):
"""...docstring..."""
empty = kwargs.get('empty')

- 33,184
- 10
- 89
- 99
I wrote a version of @ninjagecko's answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).
def isplit(string, delimiter = None):
"""Like string.split but returns an iterator (lazy)
Multiple character delimters are not handled.
"""
if delimiter is None:
# Whitespace delimited by default
delim = r"\s"
elif len(delimiter) != 1:
raise ValueError("Can only handle single character delimiters",
delimiter)
else:
# Escape, incase it's "\", "*" etc.
delim = re.escape(delimiter)
return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))
Here are the tests I used (in both python 3 and python 2):
# Wrapper to make it a list
def helper(*args, **kwargs):
return list(isplit(*args, **kwargs))
# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "]
# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]
# Surrounding whitespace dropped
assert helper(" 1 2 3 ") == ["1", "2", "3"]
# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]
# No multi-char delimiters allowed
try:
helper(r"1,.2,.3", ",.")
assert False
except ValueError:
pass
python's regex module says that it does "the right thing" for unicode whitespace, but I haven't actually tested it.
Also available as a gist.

- 4,989
- 4
- 39
- 46
If you would also like to be able to read an iterator (as well as return one) try this:
import itertools as it
def iter_split(string, sep=None):
sep = sep or ' '
groups = it.groupby(string, lambda s: s != sep)
return (''.join(g) for k, g in groups if k)
Usage
>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

- 5,087
- 1
- 42
- 41
No, but it should be easy enough to write one using itertools.takewhile()
.
EDIT:
Very simple, half-broken implementation:
import itertools
import string
def isplitwords(s):
i = iter(s)
while True:
r = []
for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
r.append(c)
else:
if r:
yield ''.join(r)
continue
else:
raise StopIteration()

- 776,304
- 153
- 1,341
- 1,358
-
@Ignacio: The example in docs uses a list of integers to illustrate the use of `takeWhile`. What would be a good `predicate` for splitting a string into words (default `split`) using `takeWhile()`? – Manoj Govindan Oct 05 '10 at 08:36
-
-
The separator can have multiple characters, `'abc
ghi<><>lmn'.split('<>') == ['abc – kennytm Oct 05 '10 at 08:42 -
-
1Easy to write, but *many* orders of magnitude slower. This is an operation that really should be implemented in native code. – Glenn Maynard Oct 05 '10 at 08:43
-
@KennyTM: Sure, it *can* be. But it doesn't always need to be, and it usually is not. – Ignacio Vazquez-Abrams Oct 05 '10 at 08:44
-
@Glenn: Is string type's `split` implemented in native code? I checked `string.split` and found it dispatches to `s.split` where `s` is the first argument to `string.split`. – Manoj Govindan Oct 05 '10 at 09:09
-
@Manoj: `str` and `unicode` are implemented in native code, so yes. – Ignacio Vazquez-Abrams Oct 05 '10 at 09:13
-
@Ignacio: Got it. Is a native generator version possible at all? – Manoj Govindan Oct 05 '10 at 09:27
-
Probably. You may need to implement a new type for the generator and fill its `tp_iternext` member, but I don't know all the details. – Ignacio Vazquez-Abrams Oct 05 '10 at 09:33
-
It's a lot more work, and I doubt the value of this to begin with, but anything you can do in Python you can do natively if you really want to. – Glenn Maynard Oct 05 '10 at 09:52
I don't see any obvious benefit to a generator version of split()
. The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.
If you wanted to write one it would be fairly easy though:
import string
def gsplit(s,sep=string.whitespace):
word = []
for c in s:
if c in sep:
if word:
yield "".join(word)
word = []
else:
word.append(c)
if word:
yield "".join(word)

- 190,537
- 57
- 313
- 299
-
4You'd halve the memory used, by not having to store a second copy of the string in each resulting part, plus the array and object overhead (which is typically more than the strings themselves). That generally doesn't matter, though (if you're splitting strings so large that this matters, you're probably doing something wrong), and even a native C generator implementation would always be significantly slower than doing it all at once. – Glenn Maynard Oct 05 '10 at 08:58
-
@Glenn Maynard - I just realised that. I for some reason I originally the generator would store a copy of the string rather than a reference. A quick check with `id()` put me right. And obviously as strings are immutable you don't need to worry about someone changing the original string while you're iterating over it. – David Webb Oct 05 '10 at 09:02
-
6Isn't the main point in using a generator not the memory usage, but that you could save yourself having to split the whole string if you wanted to exit early? (That's not a comment on your particular solution, I was just surprised by the discussion about memory). – Scott Griffiths Oct 05 '10 at 16:15
-
@Scott: It's hard to think of a case where that's really a win--where 1: you want to stop splitting partway through, 2: you don't know how many words you're splitting in advance, 3: you have a large enough string for it to matter, and 4: you consistently stop early enough for it to be a significant win over str.split. That's a very narrow set of conditions. – Glenn Maynard Oct 05 '10 at 20:35
-
4You can have much higher benefit if your string is lazily generated as well (e.g. from network traffic or file reads) – Lie Ryan Feb 22 '11 at 10:53
more_itertools.split_at
offers an analog to str.split
for iterators.
>>> import more_itertools as mit
>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]
>>> "abcdcba".split("b")
['a', 'cdc', 'a']
more_itertools
is a third-party package.
-
1Note that more_itertools.split_at() is still using a newly allocated list on each call, so while this does return an iterator, it is not achieving the constant memory requirement. So depending on why you wanted an iterator to begin with, this may or may not be helpful. – jcater Aug 06 '19 at 13:13
-
@jcater Good point. The intermediate values are indeed buffered as sub lists within the iterator, according to its [implementation](https://more-itertools.readthedocs.io/en/latest/_modules/more_itertools/more.html#split_at). One could adapt the source to substitute lists with iterators, append with `itertools.chain` and evaluate results using a list comprehension. Depending on the need and request, I can post an example. – pylang Aug 06 '19 at 17:06
I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
print(string[prev.end(): curr.start()])
note:
- I use prev & curr instead of prev & next because overriding next in python is a very bad idea
- This is quite efficient

- 934
- 2
- 10
- 31
Dumbest method, without regex / itertools:
def isplit(text, split='\n'):
while text != '':
end = text.find(split)
if end == -1:
yield text
text = ''
else:
yield text[:end]
text = text[end + 1:]

- 861
- 11
- 15
Very old question, but here is my humble contribution with an efficient algorithm:
def str_split(text: str, separator: str) -> Iterable[str]:
i = 0
n = len(text)
while i <= n:
j = text.find(separator, i)
if j == -1:
j = n
yield text[i:j]
i = j + 1

- 3,347
- 2
- 17
- 17
def split_generator(f,s):
"""
f is a string, s is the substring we split on.
This produces a generator rather than a possibly
memory intensive list.
"""
i=0
j=0
while j<len(f):
if i>=len(f):
yield f[j:]
j=i
elif f[i] != s:
i=i+1
else:
yield [f[j:i]]
j=i+1
i=i+1

- 9
- 1
here is a simple response
def gen_str(some_string, sep):
j=0
guard = len(some_string)-1
for i,s in enumerate(some_string):
if s == sep:
yield some_string[j:i]
j=i+1
elif i!=guard:
continue
else:
yield some_string[j:]

- 1,074
- 1
- 7
- 9
def isplit(text, sep=None, maxsplit=-1):
if not isinstance(text, (str, bytes)):
raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
if sep in ('', b''):
raise ValueError('empty separator')
if maxsplit == 0 or not text:
yield text
return
regex = (
re.escape(sep) if sep is not None
else [br'\s+', r'\s+'][isinstance(text, str)]
)
yield from re.split(regex, text, maxsplit=max(0, maxsplit))

- 9,017
- 3
- 30
- 48
Here is an answer that is based on split and maxsplit. This does not use recursion.
def gsplit(todo):
chunk= 100
while todo:
splits = todo.split(maxsplit=chunk)
if len(splits) == chunk:
todo = splits.pop()
else:
todo=None
for item in splits:
yield item

- 6,455
- 3
- 32
- 42