32

Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.

pythonic metaphor
  • 10,296
  • 18
  • 68
  • 110
  • 2
    "not needed to consume most of the string"? What does this mean? The string object is all in memory, right? Since it's all in memory, and it's already a sequence, there's nothing required to iterate over the characters. Can you please define what you mean by "not needed to consume most of the string"? – S.Lott Jan 03 '11 at 16:09
  • Yes, the string is already in memory. But I don't need to traverse the whole string to figure out where to split or to create the substrings resulting from the split. – pythonic metaphor Jan 03 '11 at 16:11
  • 1
    Perhaps you need a tokeniser or scanner of some sort which provides an iterator. The answer below with the regular expression solution could work. – Chris Dennett Jan 03 '11 at 16:12
  • 7
    I think what @pythonic wants is an equivalent of `str.split()` that returns an iterator rather than a list. – moinudin Jan 03 '11 at 16:13
  • 1
    @marcog That's just what I want. I can certainly write one myself, but this seemed like the sort of thing that sitting in a python library. – pythonic metaphor Jan 03 '11 at 16:16
  • "I don't need to traverse the whole string". "an equivalent of str.split() that returns an iterator". What? The "str.split() that returns an iterator" will traverse the whole string. I still am totally baffled by the various comments on the question. Can you provide a fake code sample that shows how you'd use this magical thing which doesn't traverse the whole string, yet does a split (which will traverse the whole string)? – S.Lott Jan 03 '11 at 17:12
  • @S.Lott: I guess he has some long string with a million spaces but wants to parse just one word at a time and then decide whether to move on to the next or not. Maybe something like parsing file headers or a lexer. – Jochen Ritzel Jan 03 '11 at 18:40
  • @THC4k: That's possible. But it doesn't square not "traverse" (or not "consume") the whole string. Parsing just one word at a time still traverses the whole string. – S.Lott Jan 03 '11 at 19:40
  • 4
    @S.Lott You seem to be really confused here, but I will break it down for you. When you do `somestring.split(" ")`, for example, a whole list is allocated, `O(n)` space, whereas an iterable splitter takes only as much space as the largest splitable substring. Additionally, traversing the entire string is `O(n)` time, but if a condition is reached early which renders the rest of the computation unnecessary, this time saving can only be achieved with an iterator. – ealfonso Nov 21 '13 at 04:33
  • Possible duplicate of [Is there a generator version of \`string.split()\` in Python?](http://stackoverflow.com/questions/3862010/is-there-a-generator-version-of-string-split-in-python) – Chris_Rands Feb 28 '17 at 16:14

7 Answers7

22

Not directly splitting strings as such, but the re module has re.finditer() (and corresponding finditer() method on any compiled regular expression).

@Zero asked for an example:

>>> import re
>>> s = "The quick    brown\nfox"
>>> for m in re.finditer('\S+', s):
...     print(m.span(), m.group(0))
... 
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox
Duncan
  • 92,073
  • 11
  • 122
  • 156
6

Like s.Lott, I don't quite know what you want. Here is code that may help:

s = "This is a string."
for character in s:
    print character
for word in s.split(' '):
    print word

There are also s.index() and s.find() for finding the next character.


Later: Okay, something like this.

>>> def tokenizer(s, c):
...     i = 0
...     while True:
...         try:
...             j = s.index(c, i)
...         except ValueError:
...             yield s[i:]
...             return
...         yield s[i:j]
...         i = j + 1
... 
>>> for w in tokenizer(s, ' '):
...     print w
... 
This
is
a
string.
hughdbrown
  • 47,733
  • 20
  • 85
  • 108
3

If you don't need to consume the whole string, that's because you are looking for something specific, right? Then just look for that, with re or .find() instead of splitting. That way you can find the part of the string you are interested in, and split that.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • In the application I had in mind, I wanted to split on white space, check the third substring, depending on what that was, check the fourth or sixth substring, and then possibly process the rest of the string. – pythonic metaphor Jan 03 '11 at 18:03
  • 2
    @pythonic metaphor: Yeah, if that string is *really* long you might want to use `re` or `find`. In the other case, just split it on whitespace. I don't know, but to me your question sounds like it may be premature optimization. ;) So you have to profile it to be sure. – Lennart Regebro Jan 03 '11 at 18:07
  • 4
    @pythonic metaphor: For normal text that is just premature optimization. Text starts being "large" somewhere >>10MB. For the application you described I'd just go with `text.split(None, 6)` to get the first 6 words. If you have to split the entire text anyways just do it right away. – Jochen Ritzel Jan 03 '11 at 18:57
  • @pythonic metaphor: If those are your requirements, then please **update** the question to actually identify what you're actually trying to do. – S.Lott Jan 03 '11 at 21:37
2

There is no built-in iterator-based analog of str.split. Depending on your needs you could make a list iterator:

iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'

However, a tool from this third-party library likely offers what you want, more_itertools.split_at. See also this post for an example.

pylang
  • 40,867
  • 14
  • 129
  • 121
1

Here's an isplit function, which behaves much like split - you can turn off the regex syntax with the regex argument. It uses the re.finditer function, and returns the strings "inbetween" the matches.

import re

def isplit(s, splitter=r'\s+', regex=True):
    if not regex:
        splitter = re.escape(splitter)

    start = 0

    for m in re.finditer(splitter, s):
        begin, end = m.span()
        if begin != start:
            yield s[start:begin]
        start = end

    if s[start:]:
        yield s[start:]


_examples = ['', 'a', 'a b', ' a  b c ', '\na\tb ']

def test_isplit():
    for example in _examples:
        assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
            example, list(isplit(example)), example.split()
        )
Tomasz Gandor
  • 8,235
  • 2
  • 60
  • 55
  • Note that the `splitter` can be quite arbitrary, not only a single character like in many ideas, for example: https://gist.github.com/davidshepherd7/2857bfc620a648a90e7f - there is also some discussion about the sense of doing this - because the "string is already in RAM anyway". I think there are legitimate cases for an `isplit`. – Tomasz Gandor Nov 27 '19 at 13:31
0

Look at itertools. It contains things like takewhile, islice and groupby that allows you to slice an iterable -- a string is iterable -- into another iterable based on either indexes or a boolean condition of sorts.

izak
  • 981
  • 7
  • 10
0

You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan's answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".

The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that's a far larger time investment of course.

Mu Mind
  • 10,935
  • 4
  • 38
  • 69
Daniel DiPaolo
  • 55,313
  • 14
  • 116
  • 115