10

I have a text file like this:

11
2
3
4

11

111

Using Python 2.7, I want to turn it into a list of lists of lines, where line breaks divide items in the inner list and empty lines divide items in the outer list. Like so:

[["11","2","3","4"],["11"],["111"]]

And for this purpose, I wrote a generator function that would yield the inner lists one at a time once passed an open file object:

def readParag(fileObj):
    currentParag = []
    for line in fileObj:
        stripped = line.rstrip()
    if len(stripped) > 0: currentParag.append(stripped)
    elif len(currentParag) > 0:
        yield currentParag
        currentParag = []

That works fine, and I can call it from within a list comprehension, producing the desired result. However, it subsequently occurred to me that I might be able to do the same thing more concisely using itertools.takewhile (with a view to rewriting the generator function as a generator expression, but we'll leave that for now). This is what I tried:

from itertools import takewhile    
def readParag(fileObj):
    yield [ln.rstrip() for ln in takewhile(lambda line: line != "\n", fileObj)]

In this case, the resulting generator yields only one result (the expected first one, i.e. ["11","2","3","4"]). I had hoped that calling its next method again would cause it to evaluate takewhile(lambda line: line != "\n", fileObj) again on the remainder of the file, thus leading it to yield another list. But no: I got a StopIteration instead. So I surmised that the take while expression was being evaluated once only, at the time when the generator object was created, and not each time I called the resultant generator object's next method.

This supposition made me wonder what would happen if I called the generator function again. The result was that it created a new generator object that also yielded a single result (the expected second one, i.e. ["11"]) before throwing a StopIteration back at me. So in fact, writing this as a generator function effectively gives the same result as if I'd written it as an ordinary function and returned the list instead of yielding it.

I guess I could solve this problem by creating my own class to use instead of a generator (as in John Millikin's answer to this question). But the point is that I was hoping to write something more concise than my original generator function (possibly even a generator expression). Can somebody tell me what I'm doing wrong, and how to get it right?

Community
  • 1
  • 1
Westcroft_to_Apse
  • 1,503
  • 4
  • 20
  • 29

6 Answers6

26

What you're trying to do is a perfect job for groupby:

from itertools import groupby

def read_parag(filename):
    with open(filename) as f:
        for k,g in groupby((line.strip() for line in f), bool):
            if k:
                yield list(g)

which will give:

>>> list(read_parag('myfile.txt')
[['11', '2', '3', '4'], ['11'], ['111']]

Or in one line:

[list(g) for k,g in groupby((line.strip() for line in open('myfile.txt')), bool) if k]
Rik Poggi
  • 28,332
  • 6
  • 65
  • 82
  • 3
    Use `bool` instead of the `lambda`, and `yield` the results instead of appending them to a list -- otherwise nice! =) – Katriel Aug 07 '12 at 19:26
  • +1 out of jealousy. I wrote a version of this as a genexp but didn't think of passing groupby the stripped lines, so I had `.strip()` in two places and I didn't like the look of it. You win this round! – DSM Aug 07 '12 at 19:31
  • @DSM, would you mind posting the generator expression you came up with for comparison? – Westcroft_to_Apse Aug 07 '12 at 19:34
  • @RikPoggi: I think that `g` would do for `list(g)` and that there's a missing closing parenthesis after the `open` function call in your one-liner (i.e.: `[g for k,g in groupby((line.strip() for line in open("myfile.txt")), bool) if k]`). Otherwise, you've answered my question! Thanks! – Westcroft_to_Apse Aug 07 '12 at 20:43
  • @Westcroft_to_Apse: your sample output was showing a list of lists. If your actual case is different (example: you may just need a one time consuming iterator) change what you need to. The missing `)` was a typo, fixed it. – Rik Poggi Aug 07 '12 at 23:31
  • @RikPoggi: you're right of course about that. From the point of view of my actual case, an iterator was better than lists - I'd forgotten that in my example I'd specifically mentioned lists. – Westcroft_to_Apse Aug 08 '12 at 07:44
7

The other answers do a good job of explaining what is going on here, you need to call takewhile multiple times which your current generator does not do. Here is a fairly concise way to get the behavior you want using the built-in iter() function with a sentinel argument:

from itertools import takewhile

def readParag(fileObj):
    cond = lambda line: line != "\n"
    return iter(lambda: [ln.rstrip() for ln in takewhile(cond, fileObj)], [])
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • 1
    Thanks very much! The answer I've chosen is slightly more concise (using `groupby` instead of `takewhile`), but I'm grateful for your demonstration that the best way to get `takewhile` to work in this context is to use the `iter` function rather than a generator. – Westcroft_to_Apse Aug 07 '12 at 20:58
6

This is exactly how .takewhile() should behave. While the condition is true, it'll return elements from the underlying iterable, and as soon as it's false, it permamently switches to the iteration-done stage.

Note that this is how iterators must behave; raising StopIteration means just that, stop iterating over me, I am done.

From the python glossary on "iterator":

An object representing a stream of data. Repeated calls to the iterator’s next() method return successive items in the stream. When no more data are available a StopIteration exception is raised instead. At this point, the iterator object is exhausted and any further calls to its next() method just raise StopIteration again.

You could combine takewhile with tee to see if there are any more results in the next batch:

import itertools

def readParag(filename):
    with open(filename) as f:
        while True:
            paras = itertools.takewhile(lambda l: l.strip(), f)
            test, paras = itertools.tee(paras)
            test.next()  # raises StopIteration when the file is done
            yield (l.strip() for l in paras)

This yields generators, so each item yielded is itself a generator. You do need to consume all elements in these generators for this to continue to work; the same is true for the groupby method listed in another answer.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • This. On that note, a possible fix would be to call `takewhile()` again each time a newline is detected. – jathanism Aug 07 '12 at 19:25
  • Thanks, Martijn - that's helpful. Do you know if there is an equivalent of `.takewhile` that *doesn't* permanently switch to the 'iteration-done' stage, so that I can make my one-liner work as I want it to? Or should I just stick with my original generator function and be thankful that it gets the job done? – Westcroft_to_Apse Aug 07 '12 at 19:27
  • Use `groupby()`, as in Rik Poggi's answer. – JAB Aug 07 '12 at 19:28
  • @JAB: I used a different method, actually. – Martijn Pieters Aug 07 '12 at 19:38
  • @JAB Rik Poggi's answer is great, but unless I'm missing something (entirely possible!) it looks too complicated to re-write as a generator expression, so I'm still wondering whether something like my one-liner can be made to work? – Westcroft_to_Apse Aug 07 '12 at 19:42
  • @Westcroft_to_Apse: simply replace `yield list(g)` with `yield g`. :-) – Martijn Pieters Aug 07 '12 at 19:44
  • @Westcroft_to_Apse: or if you'd like here's in one line: `[list(g) for k,g in groupby((line.strip() for line in open('myfile.txt'), bool) if k]` – Rik Poggi Aug 07 '12 at 19:46
  • @RikPoggi: Thanks - that's really helpful and I'm going to check out `groupby` for future work. Would you mind pasting that into your original answer so it can be found more easily? – Westcroft_to_Apse Aug 07 '12 at 20:00
  • Your one-liner is a list expression, Rik. `(list(g) for k,g in groupby((line.strip() for line in open('myfile.txt'), bool) if k)` would be the generator equivalent. – JAB Aug 08 '12 at 11:25
2

If the file contents fit into memory, there is a much easier way to get the groups separated by blank lines:

with open("filename") as f:
    groups = [group.split() for group in f.read().split("\n\n")]

This approach can be made more robust by using re.split() instead of str.split() and by filtering out potential empty groups resulting from four or more consecutive line breaks.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
1

This is the documented behavior of takewhile. It takes while the condition is true. It doesn't start up again if the condition later becomes true again.

The simple fix is to make your function just call takewhile in a loop, stopping when takewhile has nothing more to return (i.e., at the end of the file):

def readParag(fileObj):
    while True:      
        nextList = [ln.rstrip() for ln in takewhile(lambda line: line != "\n", fileObj)]
        if not nextList:
            break
        yield nextList
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
0

You can call takewhile multiple times:

>>> def readParagGenerator(fileObj):
...     group = [ln.rstrip() for ln in takewhile(lambda line: line != "\n", fileObj)]
...     while len(group) > 0:
...         yield group
...         group = [ln.rstrip() for ln in takewhile(lambda line: line != "\n", fileObj)]
... 
>>> list(readParagGenerator(StringIO(F)))
[['11', '2', '3', '4'], ['11'], ['111']]
jterrace
  • 64,866
  • 22
  • 157
  • 202