Is there a generator version of `string.split()` in Python?

Question

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

[This question](http://stackoverflow.com/questions/3054604/) might be related. — Björn Pollex, Oct 05 '10 at 08:51
The reason is that it's very hard to think of a case where it's useful. Why do you want this? — Glenn Maynard, Oct 05 '10 at 09:02
@Glenn: Recently I saw a question about splitting a long string into chunks of n words. One of the solutions `split` the string and then returned a generator working on the result of `split`. That got me thinking if there was a way for `split` to return a generator to start with. — Manoj Govindan, Oct 05 '10 at 09:07
There is a relevant discussion on the Python Issue tracker: http://bugs.python.org/issue17343 — saffsd, Apr 19 '13 at 01:51
@GlennMaynard it can be useful for really large bare string/file parsing, but anybody can write generator parser himself very easy using self-brewed DFA and yield — Dmitry Ponyatov, Dec 05 '18 at 06:50

ninjagecko · Accepted Answer · 2023-04-14T22:26:05.317

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

I have confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

More general version:

In reply to a comment "I fail to see the connection with str.split", here is a more general version:

def splitStr(string, sep="\s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

    # alternatively, more verbosely:
    regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
    for match in re.finditer(regex, string):
        fragment = match.group(1)
        yield fragment

The idea is that ((?!pat).)* 'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine). In pseudocode: repeatedly consume (begin-of-string xor {sep}) + as much as possible until we would be able to begin again (or hit end of string)

Demo:

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

(One should note that str.split has an ugly behavior: it special-cases having sep=None as first doing str.strip to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep="\s+".)

(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)

(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

Excellent! I had forgotten about finditer. If one were interested in doing something like splitlines, I would suggest using this RE: '(.*\n|.+$)' str.splitlines chops off the trainling newline though (something that I don't really like...); if you wanted to replicated that part of the behavior, you could use grouping: (m.group(2) or m.group(3) for m in re.finditer('((.*)\n|(.+)$)', s)). PS: I guess the outer paren in the RE are not needed; I just feel uneasy about using | without paren :P — allyourcode, Feb 12 '15 at 17:54
What about performance? re matching should be slower that ordinary search. — anatoly techtonik, Aug 10 '16 at 04:36
How would you rewrite this split_iter function to work like `a_string.split("delimiter")`? — Moberg, Nov 14 '16 at 12:46
split accepts regular expressions anyway so it's not really faster, if you want to use the returned value in a prev next fashion, look at my answer at the bottom... — Veltzer Doron, Dec 18 '17 at 14:35
`str.split()` does not accept regular expressions, that's `re.split()` you're thinking of... — alexis, Mar 31 '18 at 13:43
If using a bold all-caps disclaimer that this "doesn't present an advantage in terms of memory", it would be nice to cite proof this method is O(N) memory, and not in fact the O(1) or O(log(N)) memory which I specifically tested for. — ninjagecko, Feb 14 '19 at 09:41
@Moberg I wanted to split on new lines so I used `r"[^\n]+"` — CpILL, Apr 30 '19 at 10:33
@allyourcode: `splitlines` does not chop end of lines any more in Python3 when you call it with `keepends=True`. I guess you already noticed, but I added this for bystanders seeing your old comment above. — kriss, Oct 27 '20 at 10:58
This is a nice solution but it doesn't actually have `sep` nor `maxsplit` arguments so I fail to see the connection with `str.split`... — Tomerikoo, Jan 07 '21 at 19:31
@Tomerikoo: I added a more general version to address how you did not see the connection to `str.split()` in its general form. It was non-trivial, so thanks for pointing that out. — ninjagecko, Jan 09 '21 at 07:31
Wow I didn't mean to send you off to work. That's an impressive edit. I'm sorry if my wording was a bit extreme, I really just meant that I would expect to see `sep` and `maxsplit` somewhere in there as the question is generally asking for a generator `split`, not specific word-split. — Tomerikoo, Jan 09 '21 at 09:51

Eli Collins · Answer 2 · 2016-09-02T23:34:16.570

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want...

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

score 13 · Answer 3 · answered Feb 21 '17 at 16:51

Did some performance testing on the various methods proposed (I won't repeat them here). Some results:

str.split (default = 0.3461570239996945
manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912
re.finditer (ninjagecko's answer) = 0.698872097000276
str.find (one of Eli Collins's answers) = 0.7230395330007013
itertools.takewhile (Ignacio Vazquez-Abrams's answer) = 2.023023967998597
str.split(..., maxsplit=1) recursion = N/A†

†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway.

Tested using timeit on:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.split is so much faster despite its memory usage.

This is because memory is slower than cpu and in this case, the list is loaded by chunks where as all the others are loaded element by element. On the same note, many academics will tell you linked lists are faster and have less complexity while your computer will often be faster with arrays, which it finds easier to optimise. **You can't assume an option is faster than another, test it !** +1 for testing. — Benoît P, Feb 12 '19 at 14:54
The problem arise in the next steps of a processing chain. If you then want to find an specific chunk and ignore the rest when you find it, then you have the justification to use a generator based split instead of the built-in solution. — jgomo3, Feb 17 '20 at 15:09

Bernd Petersohn · Answer 4 · 2010-10-05T16:12:50.167

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

@ErikKaplun Because the regex logic for the items can be more complex than for their separators. In my case, I wanted to process each line individually, so I can report back if a line failed to match. — rovyko, Apr 30 '20 at 19:14

Oleh Prypin · Answer 5 · 2012-10-06T22:41:58.547

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I'll just copy the docstring of the main str_split function:

str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts (empty keyword argument is responsible for that). This is a generator function.

When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split(), except this function is a generator.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

score 3 · Answer 6 · answered Apr 17 '15 at 11:43

I wrote a version of @ninjagecko's answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python's regex module says that it does "the right thing" for unicode whitespace, but I haven't actually tested it.

Also available as a gist.

score 3 · Answer 7 · answered Jan 08 '16 at 12:54

If you would also like to be able to read an iterator (as well as return one) try this:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

Ignacio Vazquez-Abrams · Answer 8 · 2010-10-05T08:53:21.073

3

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

edited Oct 05 '10 at 08:53

answered Oct 05 '10 at 08:33

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

@Ignacio: The example in docs uses a list of integers to illustrate the use of `takeWhile`. What would be a good `predicate` for splitting a string into words (default `split`) using `takeWhile()`? – Manoj Govindan Oct 05 '10 at 08:36
Look for presence in `string.whitespace`. – Ignacio Vazquez-Abrams Oct 05 '10 at 08:37
The separator can have multiple characters, `'abcghi<><>lmn'.split('<>') == ['abc – kennytm Oct 05 '10 at 08:42
@Ignacio: Can you add an example to your answer? – Manoj Govindan Oct 05 '10 at 08:43
1

Easy to write, but *many* orders of magnitude slower. This is an operation that really should be implemented in native code. – Glenn Maynard Oct 05 '10 at 08:43
@KennyTM: Sure, it *can* be. But it doesn't always need to be, and it usually is not. – Ignacio Vazquez-Abrams Oct 05 '10 at 08:44
@Glenn: Is string type's `split` implemented in native code? I checked `string.split` and found it dispatches to `s.split` where `s` is the first argument to `string.split`. – Manoj Govindan Oct 05 '10 at 09:09
@Manoj: `str` and `unicode` are implemented in native code, so yes. – Ignacio Vazquez-Abrams Oct 05 '10 at 09:13
@Ignacio: Got it. Is a native generator version possible at all? – Manoj Govindan Oct 05 '10 at 09:27
Probably. You may need to implement a new type for the generator and fill its `tp_iternext` member, but I don't know all the details. – Ignacio Vazquez-Abrams Oct 05 '10 at 09:33
It's a lot more work, and I doubt the value of this to begin with, but anything you can do in Python you can do natively if you really want to. – Glenn Maynard Oct 05 '10 at 09:52

David Webb · Answer 9 · 2010-10-05T09:00:31.163

3

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

edited Oct 05 '10 at 09:00

answered Oct 05 '10 at 08:53

David Webb

190,537
57
313
299

4

You'd halve the memory used, by not having to store a second copy of the string in each resulting part, plus the array and object overhead (which is typically more than the strings themselves). That generally doesn't matter, though (if you're splitting strings so large that this matters, you're probably doing something wrong), and even a native C generator implementation would always be significantly slower than doing it all at once. – Glenn Maynard Oct 05 '10 at 08:58
@Glenn Maynard - I just realised that. I for some reason I originally the generator would store a copy of the string rather than a reference. A quick check with `id()` put me right. And obviously as strings are immutable you don't need to worry about someone changing the original string while you're iterating over it. – David Webb Oct 05 '10 at 09:02
6

Isn't the main point in using a generator not the memory usage, but that you could save yourself having to split the whole string if you wanted to exit early? (That's not a comment on your particular solution, I was just surprised by the discussion about memory). – Scott Griffiths Oct 05 '10 at 16:15
@Scott: It's hard to think of a case where that's really a win--where 1: you want to stop splitting partway through, 2: you don't know how many words you're splitting in advance, 3: you have a large enough string for it to matter, and 4: you consistently stop early enough for it to be a significant win over str.split. That's a very narrow set of conditions. – Glenn Maynard Oct 05 '10 at 20:35
4

You can have much higher benefit if your string is lazily generated as well (e.g. from network traffic or file reads) – Lie Ryan Feb 22 '11 at 10:53

score 3 · Answer 10 · edited Nov 19 '19 at 15:31

3

more_itertools.split_at offers an analog to str.split for iterators.

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools is a third-party package.

edited Nov 19 '19 at 15:31

blacksite

12,086
10
64
109

answered Jan 22 '18 at 06:21

pylang

40,867
14
129
121

1

Note that more_itertools.split_at() is still using a newly allocated list on each call, so while this does return an iterator, it is not achieving the constant memory requirement. So depending on why you wanted an iterator to begin with, this may or may not be helpful. – jcater Aug 06 '19 at 13:13
@jcater Good point. The intermediate values are indeed buffered as sub lists within the iterator, according to its [implementation](https://more-itertools.readthedocs.io/en/latest/_modules/more_itertools/more.html#split_at). One could adapt the source to substitute lists with iterators, append with `itertools.chain` and evaluate results using a list comprehension. Depending on the need and request, I can post an example. – pylang Aug 06 '19 at 17:06

Veltzer Doron · Answer 11 · 2018-01-22T19:15:20.827

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

note:

I use prev & curr instead of prev & next because overriding next in python is a very bad idea
This is quite efficient

score 2 · Answer 12 · answered May 21 '20 at 12:00

Dumbest method, without regex / itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

David Rissato Cruz · Answer 13 · 2021-02-08T16:35:07.617

1

Very old question, but here is my humble contribution with an efficient algorithm:

def str_split(text: str, separator: str) -> Iterable[str]:
    i = 0
    n = len(text)
    while i <= n:
        j = text.find(separator, i)
        if j == -1:
            j = n
        yield text[i:j]
        i = j + 1

edited Feb 08 '21 at 16:35

answered Jan 07 '21 at 17:30

David Rissato Cruz

3,347
2
17
17

score 0 · Answer 14 · answered Mar 11 '13 at 19:36

0

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

answered Mar 11 '13 at 19:36

travelingbones

9
1

why do you yield `[f[j:i]]`and not `f[j:i]`? – Moberg Dec 18 '14 at 14:25

score 0 · Answer 15 · answered Feb 06 '19 at 16:54

here is a simple response

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

Apalala · Answer 16 · 2021-02-26T21:28:03.203

def isplit(text, sep=None, maxsplit=-1):
    if not isinstance(text, (str, bytes)):
        raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
    if sep in ('', b''):
        raise ValueError('empty separator')

    if maxsplit == 0 or not text:
        yield text
        return

    regex = (
        re.escape(sep) if sep is not None
        else [br'\s+', r'\s+'][isinstance(text, str)]
    )
    yield from re.split(regex, text, maxsplit=max(0, maxsplit))

score 0 · Answer 17 · answered Sep 08 '21 at 02:23

Here is an answer that is based on split and maxsplit. This does not use recursion.

def gsplit(todo):
    chunk= 100
    while todo:
        splits = todo.split(maxsplit=chunk)
        if len(splits) == chunk:
            todo = splits.pop()
        else:
            todo=None
        for item in splits:
            yield item

Is there a generator version of `string.split()` in Python?

17 Answers17

More general version:

Linked

Related