I'm trying to build a regular expression to split a very long text string into components. Here's an example:
12345 2018-01-03 15:24:12 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**
So far, I have something like this:
import re
example = r"12345 2018-01-03 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**"
str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
"(?P<firstpart>START.*\*{2})"
m = re.search(str_regex, example)
m.group('id1')
## '12345'
# This is OK
m.group('adate')
## '2018-01-03'
# This is OK
m.group('id2')
## '6789'
# This is OK
m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**'
# This is where I'm lost: I'd like to match the text until the first occurrence of '**'
When I try to add a second part to the regular expression, it stops working:
str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
"(?P<firstpart>START.*\*{2})" + \
"(?P<secondpart>START.*\*{2}"
m = re.search(str_regex, example) # This doesn't match anything anymore!
What I'd like to get is a regex that splits the example string as follows:
str_regex = "..." # What to put here to split the string?
m = re.search(str_regex, example)
m.group('id1')
## '12345'
m.group('adate')
## '2018-01-03'
m.group('id2')
## '6789'
m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**'
m.group('secondpart')
## 'STARTconsectetur adipiscing elit**'
Notes:
- It is guaranteed that there will be no other occurrences of double star (
*
) anywhere inside the strings. - There will be always a header (containing
id1
,date
andid2
) and two contents parts (firstpart
andsecondpart
). Anything else can (and will) be ignored
Can you point me in the right direction?