Can I use itertools.groupby to return groups of lines where the first line starts with a specific character?

Question

I have a text file that looks like this:

>Start of group

text1

text2

>Start of new group

text3

I've been trying to use itertools.groupby to return groups where each group is a list of lists containing:

1) line starting with the ">" character.

2) the lines of text following the line starting with the ">" character, up to the next line starting with the ">" character.

So from the previous text, I would WANT to get:

[['>Start of group', text1, text2], ['>Start of new group', text3]]

The code I have written so far is:

with open(filename) as rfile:
    groups = []

    for key, group in groupby(rfile, lambda x: x.startswith(">")):
        groups.append(list(group))

However, this produces a list of lists where every line of the file is in its own list, like this:

[['>Start of group'],[text1],[text2],['>Start of new group'],[text3]]

I think I probably just don't understand the groupby function very well, since this is the first time I'm trying to implement it, so any explanation is appreciated.

`itertools.groupby` has groups of items with common characteristics, e.g. group all uppercase letters, group all words that start with "foo". It would be harder to use here since you really just want to split the string prior to some condition. See here on [when to use `groupby`](https://stackoverflow.com/a/45873519/4531270) — pylang, May 20 '19 at 17:33

score 2 · Accepted Answer · answered May 19 '19 at 18:49

2

Here is a way to get your data without the groupby function.

fin = open('fasta.out', 'r')

data = []

for line in fin:
    line = line.rstrip()

    if line.startswith('>'):
        data.append([line])
    else:
        data[-1].append(line)

answered May 19 '19 at 18:49

Chris Charley

6,403
2
24
26

A bit cleaner: `if line.startswith('>'): data.append([]); data[-1].append(line)`. – chepner May 19 '19 at 22:17

score 0 · Answer 2 · answered May 19 '19 at 21:57

groupby groups items in an iterable by some predicate that is applied to each element. That means the grouping predicate must be able to identify the feature being grouped on by looking at just one element. Since your data doesn't allow (you must look at preceding elements to determine the grouping key), this is not a good candidate for using groupby, and Chris Charley's answer is a cleaner solution.

That said, if you are looking at this as a coding challenge rather than solving a real world problem, you could create a grouping function that stores state and keeps track of the last group label seen. A class that implements __call__ and stores the last group label seen as a property and returns that when the next input is not a group label could achieve what you are looking for.

score 0 · Answer 3 · answered May 19 '19 at 22:31

The key is to tag each line in the same group with the same number, which can be done with another generator. Consider this a demonstration of how groupby works, rather than a practical suggestion; use Chris Charley's answer instead.

def number_lines(txt):
    i = 0
    for line in text:
        if line.startswith(">"):
            i += 1
        yield (1, line)

Note the sequence of tuples produced by number_lines is automatically sorted by the first element of the tuple. In order to group them, tell groupby to use the first element as the "group tag".

from operator import itemgetter

with open(filename) as rfile:
    numbered_lines = number(rfile)
    groups = [[line for n, line in group]
              for number, group in groupby(numbered_lines, itemgetter(0))]

Can I use itertools.groupby to return groups of lines where the first line starts with a specific character?

3 Answers3