2

I'm trying to extract a string in the middle of a line with or without a particular word on the end. For example, this line:

START - some words and not THIS 

should return "some words and not" and likewise, the line:

START - some words and not

should also return the same string. I've tried using lookahead from examples I've found with alternation for EOL, but adding the alternation returns a string ending with THIS. Here is the python regex:

[^-]*- (.+(?= THIS|$))

Removing |$ works, except when the line ends without THIS. The data I'm parsing has a small number of entries missing "THIS", so I need to account for both. What's the correct pattern for this?

Jan
  • 42,290
  • 8
  • 54
  • 79
stickybun
  • 33
  • 3

3 Answers3

1

You may use a lazy quantifier (.+?) as in

[^-]*- (.+?)(?:THIS|$)

See a demo on regex101.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
0

Please, take a look at this.

Basing on your example the following regex (?<=START - )(.*)(?=THIS) will catch some words and not. Hope it will help!

  • In testing (using regex101.com), it doesn't handle the case where THIS is not found at the end of the line, i.e. it ends in "not" – stickybun Feb 06 '20 at 18:30
  • @stickybun I just added an update that does exactly that. Let me know if it is not what you need. – Lord Elrond Feb 06 '20 at 18:32
0

If I understand correctly, this should do the trick:

>>> regex = re.compile(r"(?!THIS)([^-]*- .+)(THIS)?$")
>>> s1 = 'START - some words and not THIS'
>>> regex.match(s1).groups()
('START - some words and not ', 'THIS')
>>> s2 = 'START - some words and not '
>>> regex.match(s2).groups()
('START - some words and not ', None)
Lord Elrond
  • 13,430
  • 7
  • 40
  • 80
  • Not working for me. I've been using regex101.com to test, but going right to python I get: >>> regex = re.compile(r"(?!THIS)([^-]*- .+)(THIS)?") >>> s1 = 'START - some words and not THIS' >>> s2 = 'START - some words and not' >>> regex.match(s1).groups()[0] 'START - some words and not THIS' – stickybun Feb 06 '20 at 18:43
  • @stickybun sorry, I forgot the `$` at the end. It should work now. – Lord Elrond Feb 06 '20 at 18:48
  • Nada. Same result, at regex101 and in python. – stickybun Feb 06 '20 at 18:52
  • @stickybun What result is that? I just ran the above in terminal so I don't see how our results can differ. – Lord Elrond Feb 06 '20 at 18:54
  • Not in python 3.7. I can paste your code into a python session verbatim, and the groups for the regex.match(s1) produce: ('START - some words and not THIS', None) – stickybun Feb 06 '20 at 20:16