Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.
-
2"not needed to consume most of the string"? What does this mean? The string object is all in memory, right? Since it's all in memory, and it's already a sequence, there's nothing required to iterate over the characters. Can you please define what you mean by "not needed to consume most of the string"? – S.Lott Jan 03 '11 at 16:09
-
Yes, the string is already in memory. But I don't need to traverse the whole string to figure out where to split or to create the substrings resulting from the split. – pythonic metaphor Jan 03 '11 at 16:11
-
1Perhaps you need a tokeniser or scanner of some sort which provides an iterator. The answer below with the regular expression solution could work. – Chris Dennett Jan 03 '11 at 16:12
-
7I think what @pythonic wants is an equivalent of `str.split()` that returns an iterator rather than a list. – moinudin Jan 03 '11 at 16:13
-
1@marcog That's just what I want. I can certainly write one myself, but this seemed like the sort of thing that sitting in a python library. – pythonic metaphor Jan 03 '11 at 16:16
-
"I don't need to traverse the whole string". "an equivalent of str.split() that returns an iterator". What? The "str.split() that returns an iterator" will traverse the whole string. I still am totally baffled by the various comments on the question. Can you provide a fake code sample that shows how you'd use this magical thing which doesn't traverse the whole string, yet does a split (which will traverse the whole string)? – S.Lott Jan 03 '11 at 17:12
-
@S.Lott: I guess he has some long string with a million spaces but wants to parse just one word at a time and then decide whether to move on to the next or not. Maybe something like parsing file headers or a lexer. – Jochen Ritzel Jan 03 '11 at 18:40
-
@THC4k: That's possible. But it doesn't square not "traverse" (or not "consume") the whole string. Parsing just one word at a time still traverses the whole string. – S.Lott Jan 03 '11 at 19:40
-
4@S.Lott You seem to be really confused here, but I will break it down for you. When you do `somestring.split(" ")`, for example, a whole list is allocated, `O(n)` space, whereas an iterable splitter takes only as much space as the largest splitable substring. Additionally, traversing the entire string is `O(n)` time, but if a condition is reached early which renders the rest of the computation unnecessary, this time saving can only be achieved with an iterator. – ealfonso Nov 21 '13 at 04:33
-
Possible duplicate of [Is there a generator version of \`string.split()\` in Python?](http://stackoverflow.com/questions/3862010/is-there-a-generator-version-of-string-split-in-python) – Chris_Rands Feb 28 '17 at 16:14
7 Answers
Not directly splitting strings as such, but the re
module has re.finditer()
(and corresponding finditer()
method on any compiled regular expression).
@Zero asked for an example:
>>> import re
>>> s = "The quick brown\nfox"
>>> for m in re.finditer('\S+', s):
... print(m.span(), m.group(0))
...
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox

- 92,073
- 11
- 122
- 156
-
2An example of how to use `re.finditer()` to iterate split strings would be helpful. – Zero Apr 01 '15 at 04:52
-
1
Like s.Lott, I don't quite know what you want. Here is code that may help:
s = "This is a string."
for character in s:
print character
for word in s.split(' '):
print word
There are also s.index() and s.find() for finding the next character.
Later: Okay, something like this.
>>> def tokenizer(s, c):
... i = 0
... while True:
... try:
... j = s.index(c, i)
... except ValueError:
... yield s[i:]
... return
... yield s[i:j]
... i = j + 1
...
>>> for w in tokenizer(s, ' '):
... print w
...
This
is
a
string.

- 47,733
- 20
- 85
- 108
-
2See the clarification in the comments. This doesn't answer the question. – moinudin Jan 03 '11 at 16:19
-
-
3@7vies: I thought this was better than saying "No" or saying "Use regular expressions (i.e. the answer above)." – hughdbrown Jan 03 '11 at 16:27
If you don't need to consume the whole string, that's because you are looking for something specific, right? Then just look for that, with re
or .find()
instead of splitting. That way you can find the part of the string you are interested in, and split that.

- 167,292
- 41
- 224
- 251
-
In the application I had in mind, I wanted to split on white space, check the third substring, depending on what that was, check the fourth or sixth substring, and then possibly process the rest of the string. – pythonic metaphor Jan 03 '11 at 18:03
-
2@pythonic metaphor: Yeah, if that string is *really* long you might want to use `re` or `find`. In the other case, just split it on whitespace. I don't know, but to me your question sounds like it may be premature optimization. ;) So you have to profile it to be sure. – Lennart Regebro Jan 03 '11 at 18:07
-
4@pythonic metaphor: For normal text that is just premature optimization. Text starts being "large" somewhere >>10MB. For the application you described I'd just go with `text.split(None, 6)` to get the first 6 words. If you have to split the entire text anyways just do it right away. – Jochen Ritzel Jan 03 '11 at 18:57
-
@pythonic metaphor: If those are your requirements, then please **update** the question to actually identify what you're actually trying to do. – S.Lott Jan 03 '11 at 21:37
There is no built-in iterator-based analog of str.split
. Depending on your needs you could make a list iterator:
iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'
However, a tool from this third-party library likely offers what you want, more_itertools.split_at
. See also this post for an example.

- 40,867
- 14
- 129
- 121
Here's an isplit
function, which behaves much like split - you can turn off the regex syntax with the regex
argument. It uses the re.finditer
function, and returns the strings "inbetween" the matches.
import re
def isplit(s, splitter=r'\s+', regex=True):
if not regex:
splitter = re.escape(splitter)
start = 0
for m in re.finditer(splitter, s):
begin, end = m.span()
if begin != start:
yield s[start:begin]
start = end
if s[start:]:
yield s[start:]
_examples = ['', 'a', 'a b', ' a b c ', '\na\tb ']
def test_isplit():
for example in _examples:
assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
example, list(isplit(example)), example.split()
)

- 8,235
- 2
- 60
- 55
-
Note that the `splitter` can be quite arbitrary, not only a single character like in many ideas, for example: https://gist.github.com/davidshepherd7/2857bfc620a648a90e7f - there is also some discussion about the sense of doing this - because the "string is already in RAM anyway". I think there are legitimate cases for an `isplit`. – Tomasz Gandor Nov 27 '19 at 13:31
Look at itertools
. It contains things like takewhile
, islice
and groupby
that allows you to slice an iterable -- a string is iterable -- into another iterable based on either indexes or a boolean condition of sorts.

- 981
- 7
- 10
You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan's answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".
The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that's a far larger time investment of course.

- 10,935
- 4
- 38
- 69

- 55,313
- 14
- 116
- 115