Iterator using itertools is skipping a line

Question

I have the feeling that my question is related to Why does takewhile() skip the first line?

I haven't found satisfactory answers in there though.

My examples below use the following modules

import csv
from itertools import takewhile

Here is my problem. I have a csv file which I want to parse using itertools.

For instance, i want to separate the header from the content. This is spotted by the presence of a keyword in the first column.

Here is file.csv example

a, content
b, content
KEYWORD, something else
c, let's continue

The two first lines compose the header of the file. The KEYWORD line separates it from the content: the last line.

Even, if it is not properly part of the content, I want to parse the separation row.

with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    header = takewhile(lambda x: x[0] != 'KEYWORD', reader)
    for row in header:
        print(row)
    print('End of header')
    for row in reader:
        print(row)

I was not expecting this, but the KEYWORD line is skipped. As you will see in the following output:

['a', ' content']
['b', ' content']
End of header
['c', " let's continue"]

I have tried simulating the csv reader to see if it was coming from there. But apparently not. The following code produces the same behavior.

l = [['a', 'content'],
    ['b','content'],
    ['KEYWORD', 'something else'],
    ['c', "let's continue"]]

i = iter(l)
header = takewhile(lambda x: x[0] != 'KEYWORD', i)
for row in header:
    print(row)
print('End of header')
for row in i:
    print(row)

How can I do to use the feature of takewhile, while preventing the following for the skip the unparsed line ?

As I have understood, the first for calls for next on the iterator, to test its content. The second calls for next once again, to gather the value. And the separation row is hence skipped.

jonrsharpe · Answer 1 · 2014-04-29T09:50:02.620

2

I think you will have to restructure - takewhile isn't a good fit for what you are doing. The problem is that takewhile has to read the line starting 'KEYWORD' to determine that it has reached a line it shouldn't take, and once the line is read the file's "read head" is at the start of the next line. Similarly, with iter, takewhile has already consumed (but discarded) the line starting 'KEYWORD' when you start for row in i.

One alternative would be something like:

header = []
content = []
target = header
for row in reader:
    if line.startswith('KEYWORD'):
        target = content
    target.append(row)

edited Apr 29 '14 at 09:50

answered Apr 29 '14 at 09:23

jonrsharpe

115,751
26
228
437

Thank you for your answer, but I have more tricky process than adding the content to a list. Hence your answer is not fully satisfactory to my problem. You lead me to rethink a bit of some workaround though. Thank you (see my answer). – carrieje Apr 29 '14 at 10:10

carrieje · Accepted Answer · 2014-04-29T12:48:07.247

Thanks to @jonrsharpe, I came to question myself on some trick to code. Here is what I reached :

class RewindableFile(file):
    def __init__(self, *args, **kwargs):
        nb_backup = kwargs.pop('nb_backup', 1)
        super(RewindableFile, self).__init__(*args, **kwargs)
        self._nb_backup = nb_backup
        self._backups = []
        self._time_anchor = 0

    def next(self):
        if self._time_anchor >= 0:
            item = super(RewindableFile, self).next()
            self._backup(item)
            return item
        else:
            item = self._forward()
            return item

    def rewind(self):
        self._time_anchor = self._time_anchor - 1
        time_bound = min(self._nb_backup, len(self._backups))
        if self._time_anchor < -time_bound:
            raise Exception('You have gone too far in history...')

    def __iter__(self):
        return self

    def _backup(self, row):
        self._backups.append(row)
        extra_items = len(self._backups) - self._nb_backup
        if extra_items > 0:
            del self._backups[0:extra_items]

    def _forward(self):
        item = self._backups[self._time_anchor]
        self._time_anchor = self._time_anchor + 1
        return item

And how I use it :

with RewindableFile('csv.csv', 'rb') as f:
    def test_kwd_and_rewind(x):
        if x[0] != 'KEYWORD':
            return True
        else:
            f.rewind()
            return False

    reader = csv.reader(f)
    header = takewhile(test_kwd_and_rewind, reader)
    for row in header:
        print(row)
    print('End of header')
    for row in reader:
        print(row)

I could have also overload read and readline functions to save the jump. But I don't need them here.

Inspired on http://stackoverflow.com/questions/3539107/python-rewinding-one-line-in-file-when-iterating-with-f-next, and to save buffering operated naturally by `file`, I migrated from a `seek` to a backup one. I altered the code consequently. — carrieje, Apr 29 '14 at 12:49
Afterwards, the solution I quoted before is better. It decorates the file, instead of redefining it. Hence, I recommend their approach. While still adding some of my features as adaptive capacity of backup. https://en.wikipedia.org/wiki/Decorator_pattern — carrieje, Apr 29 '14 at 13:16

Kei Minagawa · Answer 3 · 2014-04-29T12:02:27.693

0

You can write your own takewhile like this.

def takewhile(predicate, iterable):
    for x in iterable:
        yield x
        if not predicate(x):
            break

test:

>>> list(takewhile(lambda x:x!=3, range(10)))
[0, 1, 2, 3]

edited Apr 29 '14 at 12:02

answered Apr 29 '14 at 10:40

Kei Minagawa

4,395
3
25
43

That would require takewhile to be able to access iterable next without consuming it. I can't see any possible approach on this side. That was my first intent before creating a RewindableFile. But thanks for your answer. – carrieje Apr 29 '14 at 11:48

score 0 · Answer 4 · answered Jan 27 '16 at 15:42

jonrsharpe has it right. This isn't quite a job for takewhile. itertools also has a groupby function which can more easily handle the splitting. The LastHeaderclass below keeps a record of the last header line passed through the check method, and returns a reference to it each time check is called. This lets you run through the file a single time, without having to backtrack any.

class LastHeader():
    """Checks for new header strings. For use with groupby"""
    def __init__(self, sentinel='#'):
        self.sentinel = sentinel
        self.lastheader = ''

    def check(self, line):
        if line.startswith(self.sentinel):
            self.lastheader = line
        return self.lastheader

with open(fname, 'r') as fobj:
    lastheader = LastHeader(sentinel)
    for headerline, readlines in groupby(fobj, lastheader.check):
        foo(headerline)
        for line in readlines:
            bar(line)

where foo and bar are whatever processing you need to do on the headers and data.

Iterator using itertools is skipping a line

4 Answers4