0

I'm trying to build a regular expression to split a very long text string into components. Here's an example:

12345 2018-01-03 15:24:12 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**

So far, I have something like this:

import re

example = r"12345 2018-01-03 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**"
str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
    "(?P<firstpart>START.*\*{2})"
m = re.search(str_regex, example)

m.group('id1')
## '12345'
# This is OK

m.group('adate')
## '2018-01-03'
# This is OK

m.group('id2')
## '6789'
# This is OK

m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**'
# This is where I'm lost: I'd like to match the text until the first occurrence of '**'

When I try to add a second part to the regular expression, it stops working:

str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
    "(?P<firstpart>START.*\*{2})" + \
    "(?P<secondpart>START.*\*{2}"
m = re.search(str_regex, example) # This doesn't match anything anymore!

What I'd like to get is a regex that splits the example string as follows:

str_regex = "..." # What to put here to split the string?
m = re.search(str_regex, example)

m.group('id1')
## '12345'

m.group('adate')
## '2018-01-03'

m.group('id2')
## '6789'

m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**'

m.group('secondpart')
## 'STARTconsectetur adipiscing elit**'

Notes:

  • It is guaranteed that there will be no other occurrences of double star (*) anywhere inside the strings.
  • There will be always a header (containing id1, date and id2) and two contents parts (firstpart and secondpart). Anything else can (and will) be ignored

Can you point me in the right direction?

Barranka
  • 20,547
  • 13
  • 65
  • 83
  • It looks like you also have an optional time part, haven't you? Check [this regex demo](https://regex101.com/r/Sua15u/3) – Wiktor Stribiżew Apr 15 '19 at 15:17
  • @WiktorStribiżew Although it's not included in the example, yes, I have an optional time part... but I've succesfully dealt with that. My real problem is with the "contents" parts – Barranka Apr 15 '19 at 15:22
  • 1
    I think the main thing you want to look at is using the [lazy wildcard](https://www.rexegg.com/regex-quantifiers.html#lazy) – Tyler Marshall Apr 15 '19 at 15:23
  • 2
    Then your only issue is `.*`. You must use `.*?`. The `.*` matches any 0+ chars as many as possible, and `*?` will make it match as few as possible. – Wiktor Stribiżew Apr 15 '19 at 15:23

0 Answers0