Finding block of lines starting with a specific character

Question

(I edited the question for clarification)

I would appreciate suggestions on how to implement the following in python: given the text

> first
> second
third
fourth
> fifth
> sixth
> seventh

I would like to get two subtexts:

first
second

and

fifth
sixth
seventh

i.e. given an input of some lines of text, the output should be the blocks of lines which start with >. A "block" in my definition here is a set of consecutive lines all starting with >. In the example above since the third line doesn't start with > it "cuts" the above two lines into a single block. The second block then starts on the first line which starts with >, i.e. the fifth line.

"each quote starts with `>` (similar to stackoverflow's four spaces)"—Stack Overflow uses `>` for blockquotes, just like all Markdown tools I've ever used. It's part of the standard, the original reference implementation, and has been used for quoting in email since time began. Four spaces in Markdown represent a _code block_. — ChrisGPT was on strike, Feb 01 '22 at 15:55
I'm not clear what you're asking. You've stated a problem, outlined your planned solution, and told us where you want to start. Do you have a question? Are you asking us to write a regular expression that matches leading `>` characters? You've tagged this with [tag:python] and [tag:regex]. Have you tried to parse the input? Have you considered using a proper Markdown library instead of cobbling together regular expressions? Please read [ask]. — ChrisGPT was on strike, Feb 01 '22 at 15:57
...actually, I re-wrote the question to be more short and to the point. — pelegs, Feb 01 '22 at 20:03
What does this have to do with converting between quote styles? It looks like you removed that aspect of the question in the last edit. — wjandrea, Feb 01 '22 at 20:18
I did, because the entire thing about why I want to do it is irrelevant and makes the question harder to follow. I have a work plan to solve the big problem I face, I just need to find how to implement the tiny "blocks" issue presented in the edited question. — pelegs, Feb 01 '22 at 20:41
@pelegs I mean the title still says "Converting between quote markdown styles" — wjandrea, Feb 01 '22 at 21:22

score 0 · Answer 1 · answered Feb 02 '22 at 13:20

I decided to use a brute-force approach to solving the issue. It's not elegant but it works (the code using consecutive_groups was taken from an answer to this question):

from more_itertools import consecutive_groups

def get_block_ids(s, sep='>'):
    idx = [i for i, line in enumerate(s) if line != '' and line[0] == sep]
    idx_grouped = [list(group) for group in consecutive_groups(idx)]
    idx_ranges = [(g[0], g[-1]) for g in idx_grouped]
    return idx_ranges

The function get_block_ids returns a list of tuples, each one containing the indices of the first and last line in the respective block found in the string s.

Finding block of lines starting with a specific character

1 Answers1