0

I have a set of lines where most of them follow this format

STARTKEYWORD some text I want to extract ENDKEYWORD\n

I want to find these lines and extract information from them.

Note, that the text between keywords can contain a wide range of characters (latin and non-latin letters, numbers, spaces, special characters) except \n.

ENDKEYWORD is optional and sometimes can be omitted.

My attempts are revolving around this regex

STARTKEYWORD (.+)(?:\n| ENDKEYWORD)

However capturing group (.+) consumes as many characters as possible and takes ENDKEYWORD which I do not need.

Is there a way to get some text I want to extract solely with regular expressions?

Konstantin
  • 24,271
  • 5
  • 48
  • 65

2 Answers2

1

You can make (.+) non greedy (which is by default greedy and eats whatever comes in its way) by adding ? and add $ instead of \n for making more efficient

STARTKEYWORD (.+?)(?:$| ENDKEYWORD$)

If you specifically want \n you can use:

STARTKEYWORD (.+?)(?:\n| ENDKEYWORD\n)

See DEMO

karthik manchala
  • 13,492
  • 1
  • 31
  • 55
1

You could use a lookahead based regex. It always better to use $ end of the line anchor since the last line won't contain a newline character at the last.

STARTKEYWORD (.+?)(?= ENDKEYWORD|$)

OR

STARTKEYWORD (.+?)(?: ENDKEYWORD|$)

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274