Why does takewhile() skip the first line?

Question

I have a file like this:

1
2
3
TAB
1
2
3
TAB

I want to read the lines between TAB as blocks.

import itertools

def block_generator(file):
    with open(file) as lines:
        for line in lines:
            block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
                                             lines))
            yield block

I want to use it as such:

blocks = block_generator(myfile)
for block in blocks:
    do_something(block)

The blocks i get all start with the second line like [2,3] [2,3], why?

the for loop is eating the first line of each block – John La Rooy Sep 02 '11 at 01:51 — John La Rooy, Sep 02 '11 at 01:51

John La Rooy · Accepted Answer · 2011-09-02T03:31:33.780

4

Here is another approach using groupby

from itertools import groupby
def block_generator(filename):
    with open(filename) as lines:
        for pred,block in groupby(lines, "\t\n".__ne__):
            if pred:
                yield block

edited Sep 02 '11 at 03:31

answered Sep 02 '11 at 01:57

John La Rooy

295,403
53
369
502

Hi @gnibbler, you code may works fine for small files. I have a very big file I do not want to read them all at a time. But thanks for your code. – gstar2002 Sep 02 '11 at 02:04
@gstar, why do you think my code reads the whole file at once? – John La Rooy Sep 02 '11 at 02:08
Why not just `for x, y in groupby(lines, "\t\n".__ne__): if x: yield list(y)`? (I was going to say "why not just return the generator expression", but I guess that results in the context manager triggering prematurely...) (I was surprised to find that `groupby` does not collate groups with the same key...) – Karl Knechtel Sep 02 '11 at 02:52
@Karl, yeah that's nicer I think. and yes you need to yield inside the with block to prevent the file being closed prematurely – John La Rooy Sep 02 '11 at 03:33
@gnibbler, real groupby() should be a global Operation on the whole list, it must somehow first read all the lines in. But you are right here, groupby() in pythong is a change detector. – gstar2002 Sep 05 '11 at 18:37

steveha · Answer 2 · 2011-09-02T06:23:47.563

Here you go, tested code. Uses while True: to loop, and lets itertools.takewhile() do everything with lines. When itertools.takewhile() reaches the end of input, it returns an iterator that does nothing except raise StopIteration, which list() simply turns into an empty list, so a simple if not block: test detects the empty list and breaks out of the loop.

import itertools

def not_tabline(line):
    return '\t' != line.rstrip('\n')

def block_generator(file):
    with open(file) as lines:
        while True:
            block = list(itertools.takewhile(not_tabline, lines))
            if not block:
                break
            yield block

for block in block_generator("test.txt"):
    print "BLOCK:"
    print block

As noted in a comment below, this has one flaw: if the input text has two lines in a row with just the tab character, this loop will stop processing without reading all the input text. And I cannot think of any way to handle this cleanly; it's really unfortunate that the iterator you get back from itertools.takewhile() uses StopIteration both as the marker for the end of a group and as what you get at end-of-file. To make it worse, I cannot find any way to ask a file iterator object whether it has reached end-of-file or not. And to make it even worse, itertools.takewhile() seems to advance the file iterator to end-of-file instantly; when I tried to rewrite the above to check on our progress using lines.tell() it was already at end-of-file after the first group.

I suggest using the itertools.groupby() solution. It's cleaner.

Great, I shoud try using your code. Thanks. I do not know, if Regex can also do the job. — gstar2002, Sep 02 '11 at 02:08
@Paul McGuire, that is a truly excellent point. I think the `itertools.groupby()` answer is cleaner and doesn't have this flaw. — steveha, Sep 02 '11 at 06:05

score 1 · Answer 3 · answered Sep 02 '11 at 01:22

1

I think the problem is that you are taking lines in your lambda function rather than line. What is your expected output?

answered Sep 02 '11 at 01:22

Benjamin

11,560
13
70
119

score 1 · Answer 4 · answered Sep 02 '11 at 01:32

1

itertools.takewhile implicitly iterates over the lines of the file in order to grab chunks, but so does for line in lines:. Each time through the loop, a line is grabbed, thrown away (since there is no code that uses line), and then some more are blocked together.

answered Sep 02 '11 at 01:32

Karl Knechtel

62,466
11
102
153

hi Karl, I have thought about that. After the first takewhile() the file pointer point at TAB line, after I processed the first block, "for" move the file pointer to the next line,'1' and give it to takewhile(). It should be right.but... – gstar2002 Sep 02 '11 at 01:43
The for loop does not "move a file pointer"; that is the wrong way to think about it. It iterates over lines of the file. The first time through the loop, `line` is equal to `'1\n'`. That value has been consumed, and is no longer available for `takewhile()`. – Karl Knechtel Sep 02 '11 at 01:44
Ok.I see it. So 'code'takewhile() consumed TAB line. Then 'code'for consumed '1\n' line, so 'code'takewhile() get lines from '2\n'. great. – gstar2002 Sep 02 '11 at 01:59

Why does takewhile() skip the first line?

4 Answers4

Linked