144

I have a multi-line string defined like this:

foo = """
this is 
a multi-line string.
"""

This string we used as test-input for a parser I am writing. The parser-function receives a file-object as input and iterates over it. It does also call the next() method directly to skip lines, so I really need an iterator as input, not an iterable. I need an iterator that iterates over the individual lines of that string like a file-object would over the lines of a text-file. I could of course do it like this:

lineiterator = iter(foo.splitlines())

Is there a more direct way of doing this? In this scenario the string has to traversed once for the splitting, and then again by the parser. It doesn't matter in my test-case, since the string is very short there, I am just asking out of curiosity. Python has so many useful and efficient built-ins for such stuff, but I could find nothing that suits this need.

Andriy
  • 2,767
  • 2
  • 21
  • 29
Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
  • 22
    you're aware that you can iterate over `foo.splitlines()` right? – SilentGhost Jun 16 '10 at 15:17
  • What do you mean by "again by the parser"? – danben Jun 16 '10 at 15:19
  • 7
    @SilentGhost: I think the point is to not iterate the string twice. Once it is iterated by `splitlines()` and a second time by iterating over the result of this method. – Felix Kling Jun 16 '10 at 15:21
  • 6
    Is there a particular reason why splitlines() does not return an iterator by default? I thought the trend was to generally do that for iterables. Or is this only true for specific functions like dict.keys()? – Cerno Jan 31 '20 at 11:35

6 Answers6

168

Here are three possibilities:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

Running this as the main script confirms the three functions are equivalent. With timeit (and a * 100 for foo to get substantial strings for more precise measurement):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

Note we need the list() call to ensure the iterators are traversed, not just built.

IOW, the naive implementation is so much faster it isn't even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than a lower-level approach.

Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of += of very small pieces) can be quite slow.

Edit: added @Jacob's proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

Measuring gives:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

not quite as good as the .find based approach -- still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3 above, should automatically trigger off-by-one suspicions -- and so should many loops which lack such tweaks and should have them -- though I believe my code is also right since I was able to check its output with other functions').

But the split-based approach still rules.

An aside: possibly better style for f4 would be:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

at least, it's a bit less verbose. The need to strip trailing \ns unfortunately prohibits the clearer and faster replacement of the while loop with return iter(stri) (the iter part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it's also innocuous). Maybe worth trying, also:

    return itertools.imap(lambda s: s.strip('\n'), stri)

or variations thereof -- but I'm stopping here since it's pretty much a theoretical exercise wrt the strip based, simplest and fastest, one.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • Also, `(line[:-1] for line in cStringIO.StringIO(foo))` is pretty fast; almost as fast as the naive implementation, but not quite. – Matt Anderson Jun 16 '10 at 15:56
  • 1
    Thank you for this great answer. I guess the main lesson here (as I am new to python) is to make using `timeit` a habit. – Björn Pollex Jun 16 '10 at 19:20
  • @Space, yep, timeit is good, any time you care about performance (be sure to use it carefully, e.g. in this case see my note about needing a `list` call to actually time all the relevant parts!-). – Alex Martelli Jun 16 '10 at 21:20
  • Can you try to measure with some larger examples as well? Like a million lines or something. – Thomas Ahle Mar 09 '14 at 18:13
  • 8
    What about memory consumption? `split()` clearly trades memory for performance, holding a copy of all sections in addition to the list's structures. – ivan_pozdeev Apr 13 '14 at 22:18
  • There's also `re.finditer()` that might be quite fast. – rjh Jun 20 '15 at 12:01
  • I don't understand what we win with the call to `iter` in `f1`. I see no difference in behavior with or without with Python 2 and 3, no differences in perfs with Python 3, but without is about twice faster than with, with Python 2 (i.e., twice faster than Python 3 actually). – akim Aug 26 '15 at 11:59
  • akim, try calling `next(f1())` with and without the `iter` call in `f1`, and tell us again that "you see no difference in behavior" between seeing the first line (as happens *with* that `iter`) and getting a `TypeError` exception (as happens *without* that `iter`)...!-) – Alex Martelli Aug 28 '15 at 21:48
  • 1
    It's important to note that `splitlines()` does not only split on `\n` but on a whole host of things that are considered newline characters: https://docs.python.org/3.5/library/stdtypes.html#str.splitlines – Matt Boehm May 05 '16 at 14:10
  • 7
    I was really confused by your remarks at first because you listed the timing results in the opposite order of their implementation and numbering. =P – jamesdlin Oct 04 '17 at 22:59
  • Agreed @jamesdin this post made no sense to me until I saw your comment. The assumptions we can make as readers, I suppose. Alex, if you ever do read this, please do re-order those timings! – marshall.ward May 26 '18 at 11:11
  • Late to here, but the primary advantage of iterator solutions is not reduced memory but the possibility of early exit. For example, if you are searching for a particular line (near the top) the iterator method will allow you to bail out (avoiding scanning the whole of the rest of the string) early when you find it. The list producing `str.splitlines()` will continue until the end. The timings are slightly misleading in that they force the (true) iterators to process everything, which presupposes the use case. There's a lot to be said for builtins, though :). – Steve Powell Nov 15 '19 at 16:32
  • `f3` might be slower than it needs to be. Because unicode strings can have variable-length characters, `str.find` likely can't use its `start` parameter to O(1) index into the string and must iterate to there instead. This means `f3` has lots of unnecessary traversal of the string compared to, say, storing the index of the last `\n` found and iterating from there yourself. Of course, the loops that `find` uses are written in C, while you'd be using a Python loop, so it's hard to know which would be faster a priori. But worth considering. – BallpointBen Feb 09 '21 at 21:29
59

I'm not sure what you mean by "then again by the parser". After the splitting has been done, there's no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn't absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

If your string is very large, the disadvantage is in memory usage: you'll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the "splitting" penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)
Brian
  • 116,865
  • 28
  • 107
  • 112
  • 10
    Note: for python 3 you have to use the `io` package for this, e.g. use `io.StringIO` instead of `StringIO.StringIO`. See https://docs.python.org/3/library/io.html – Attila123 Dec 13 '18 at 11:09
  • 1
    Using `StringIO` is also a good way to get high-performance universal newline handling. – martineau Nov 21 '19 at 15:40
8

You can iterate over "a file", which produces lines, including the trailing newline character. To make a "virtual file" out of a string, you can use StringIO:

import io  # for Py2.7 that would be import cStringIO as io

for line in io.StringIO(foo):
    print(repr(line))
Tomasz Gandor
  • 8,235
  • 2
  • 60
  • 55
3

If I read Modules/cStringIO.c correctly, this should be quite efficient (although somewhat verbose):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration
Jacob Oscarson
  • 6,363
  • 1
  • 36
  • 46
3

Regex-based searching is sometimes faster than generator approach:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))
socketpair
  • 1,893
  • 17
  • 15
  • 2
    This question is about a specific scenario, so it would helpful to show a simple benchmark, like the top-scoring answer has done. – Björn Pollex Jun 26 '17 at 13:42
0

I suppose you could roll your own:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

I'm not sure how efficient this implementation is, but that will only iterate over your string once.

Mmm, generators.

Edit:

Of course you'll also want to add in whatever type of parsing actions you want to take, but that's pretty simple.

Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
  • Pretty inefficient for long lines (the `+=` part has worst-case `O(N squared)` performance, though several implementation tricks try to lower that when feasible). – Alex Martelli Jun 16 '10 at 15:25
  • Yeah - I've just been learning about that recently. Would it be faster to append to a list of chars and then ''.join(chars) them? Or is that an experiment I should undertake myself? ;) – Wayne Werner Jun 16 '10 at 15:40
  • please do measure yourself, it's instructive -- and be sure to try both short lines like in the OP's example, and long ones!-) – Alex Martelli Jun 16 '10 at 15:52
  • For short strings ( < ~40 chars) the += is actually quicker, but hits worst case quickly. For longer strings, the `.join` method actually looks like O(N) complexity. Since I couldn't find the particular comparison made on SO yet, I started a question http://stackoverflow.com/questions/3055477/how-slow-is-pythons-string-concatenation-vs-str-join (that surprisingly received more answers than just my own!) – Wayne Werner Jun 16 '10 at 17:26