Regular expression: Split text by custom line terminator

Question

I'm trying to build a regular expression to split a very long text string into components. Here's an example:

12345 2018-01-03 15:24:12 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**

So far, I have something like this:

import re

example = r"12345 2018-01-03 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**"
str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
    "(?P<firstpart>START.*\*{2})"
m = re.search(str_regex, example)

m.group('id1')
## '12345'
# This is OK

m.group('adate')
## '2018-01-03'
# This is OK

m.group('id2')
## '6789'
# This is OK

m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**'
# This is where I'm lost: I'd like to match the text until the first occurrence of '**'

When I try to add a second part to the regular expression, it stops working:

str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
    "(?P<firstpart>START.*\*{2})" + \
    "(?P<secondpart>START.*\*{2}"
m = re.search(str_regex, example) # This doesn't match anything anymore!

What I'd like to get is a regex that splits the example string as follows:

str_regex = "..." # What to put here to split the string?
m = re.search(str_regex, example)

m.group('id1')
## '12345'

m.group('adate')
## '2018-01-03'

m.group('id2')
## '6789'

m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**'

m.group('secondpart')
## 'STARTconsectetur adipiscing elit**'

Notes:

It is guaranteed that there will be no other occurrences of double star (*) anywhere inside the strings.
There will be always a header (containing id1, date and id2) and two contents parts (firstpart and secondpart). Anything else can (and will) be ignored

Can you point me in the right direction?

It looks like you also have an optional time part, haven't you? Check [this regex demo](https://regex101.com/r/Sua15u/3) — Wiktor Stribiżew, Apr 15 '19 at 15:17
@WiktorStribiżew Although it's not included in the example, yes, I have an optional time part... but I've succesfully dealt with that. My real problem is with the "contents" parts — Barranka, Apr 15 '19 at 15:22
I think the main thing you want to look at is using the [lazy wildcard](https://www.rexegg.com/regex-quantifiers.html#lazy) — Tyler Marshall, Apr 15 '19 at 15:23
Then your only issue is `.*`. You must use `.*?`. The `.*` matches any 0+ chars as many as possible, and `*?` will make it match as few as possible. — Wiktor Stribiżew, Apr 15 '19 at 15:23

Regular expression: Split text by custom line terminator

0 Answers0