2

I have a large, multi-line string with multiple entries following a similar format. I'd like to split it into a list of strings for each entry.

I tried the following:

myre = re.compile('Record\sTime.*-{5}', re.DOTALL)
return re.findall(myre, text)

In this case, entries start with 'Record Time', and end with '-----'. Instead of acting how I'd like, the code above returns one item, starting at beginning of the first entry, and ending at the end of the last one.

I could probably find a way to make this work by using regex to find the end of a segment, then repeat with a slice of the original text starting there, but that seems messy.

Turtles Are Cute
  • 3,200
  • 6
  • 30
  • 38

3 Answers3

5

You need to turn the .* into a reluctant match, by adding a question mark:

.*?

Otherwise it matches as much as it can, from the middle of the first record to the middle of the last record.

See Greedy vs. Reluctant vs. Possessive Quantifiers

Community
  • 1
  • 1
NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

Something like this:

txt='''\
Record Time
1
2
3
-----

Record Time
4
5
-----
Record Time
6
7
8
'''

import re
pat=re.compile(r'^Record Time$(.*?)(?:^-{5}|\Z)', re.S | re.M)
for i, block in enumerate((m.group(1) for m in pat.finditer(txt))):
    print 'block:', i
    print block.strip()

Prints:

block: 0
1
2
3
block: 1
4
5
block: 2
6
7
8
dawg
  • 98,345
  • 23
  • 131
  • 206
1

You can use this to avoid a reluctant quantifier, it's a trick to emulate an atomic group: (?=(...))\1. It's not totally in the subject but it can be usefull:

myre = re.compile('Record\sTime(?:(?=([^-]+|-(?!-{4})))\1)+-{5}')
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125