Split text into sections using python regex

Question

I have a large, multi-line string with multiple entries following a similar format. I'd like to split it into a list of strings for each entry.

I tried the following:

myre = re.compile('Record\sTime.*-{5}', re.DOTALL)
return re.findall(myre, text)

In this case, entries start with 'Record Time', and end with '-----'. Instead of acting how I'd like, the code above returns one item, starting at beginning of the first entry, and ending at the end of the last one.

I could probably find a way to make this work by using regex to find the end of a segment, then repeat with a slice of the original text starting there, but that seems messy.

score 5 · Accepted Answer · edited May 23 '17 at 11:49

5

You need to turn the .* into a reluctant match, by adding a question mark:

.*?

Otherwise it matches as much as it can, from the middle of the first record to the middle of the last record.

See Greedy vs. Reluctant vs. Possessive Quantifiers

edited May 23 '17 at 11:49

Community

1
1

answered Jan 11 '14 at 17:40

NPE

486,780
108
951
1,012

dawg · Answer 2 · 2014-01-11T18:05:52.870

1

Something like this:

txt='''\
Record Time
1
2
3
-----

Record Time
4
5
-----
Record Time
6
7
8
'''

import re
pat=re.compile(r'^Record Time$(.*?)(?:^-{5}|\Z)', re.S | re.M)
for i, block in enumerate((m.group(1) for m in pat.finditer(txt))):
    print 'block:', i
    print block.strip()

Prints:

block: 0
1
2
3
block: 1
4
5
block: 2
6
7
8

edited Jan 11 '14 at 18:05

answered Jan 11 '14 at 17:44

dawg

98,345
23
131
206

score 1 · Answer 3 · answered Jan 11 '14 at 17:55

1

You can use this to avoid a reluctant quantifier, it's a trick to emulate an atomic group: (?=(...))\1. It's not totally in the subject but it can be usefull:

myre = re.compile('Record\sTime(?:(?=([^-]+|-(?!-{4})))\1)+-{5}')

answered Jan 11 '14 at 17:55

Casimir et Hippolyte

88,009
5
94
125

Split text into sections using python regex

3 Answers3