5

Here is the example:

a = "one two three four five six one three four seven two"
m = re.search("one.*four", a)

What I want is to find the substring from "one" to "four" that doesn't contain the substring "two" in between. The answer should be: m.group(0) = "one three four", m.start() = 28, m.end() = 41

Is there a way to do this with one search line?

satoru
  • 31,822
  • 31
  • 91
  • 141
Solaris
  • 83
  • 2
  • 8

4 Answers4

8

You can use this pattern:

one(?:(?!two).)*four

Before matching any additional character we check we are not starting to match "two".

Working example: http://regex101.com/r/yY2gG8

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • 1
    So we can use `(?:(?!two).)*` like it's a multi-character version of `^`, right? – satoru Nov 03 '13 at 06:13
  • If I understand it correctly, this regexp reads "Between `one` and `four`, there should be zero or more group of characters that don't start with `two`." – satoru Nov 03 '13 at 06:15
  • @Satoru.Logic - That's right. Another option is `(?:[^t]|t[^w]|tw[^o])*`, which is compatible with regex flavors without advanced features (lookahead). – Kobi Nov 03 '13 at 06:15
  • In this case, I would happily go with `lookahead` ;) – satoru Nov 03 '13 at 06:16
  • 1
    @Kobi, `(?:[^t]|t[^w]|tw[^o])*` isn't quite right, because it can consume characters that *should* be matched by what follows. For example, `one(?:[^t]|t[^w]|tw[^o])*four` doesn't match `onetwfour` - the `f` in the string is consumed by `[^o]`. – Tim Peters Nov 03 '13 at 06:36
  • 1
    @TimPeters - Excellent point. I can't think of a good solution except also disallowing `four`, which would be too messy. Lets stay with the lookahead, as Satoru suggested... – Kobi Nov 03 '13 at 06:40
  • Agreed - your solution in the answer is by far the clearest of all :-) – Tim Peters Nov 03 '13 at 06:46
  • @Kobi: related [Regex: Matching by exclusion, without look-ahead - is it possible?](http://stackoverflow.com/q/466053/4279) – jfs Nov 03 '13 at 07:18
2

You can use the negative lookahead assertion (?!...):

re.findall("one(?!.*two).*four", a)
satoru
  • 31,822
  • 31
  • 91
  • 141
  • It works for his specific string, but not if you append " two" to his specific string - the lookahead applies to the *entire* remainder of the string, not just up until it finds the rightmost "four". – Tim Peters Nov 03 '13 at 05:34
  • That's awesome! It's awkward looking for some of these things. I didn't even know how to properly search for an answer. Thanks! – Solaris Nov 03 '13 at 05:44
  • 1
    @user2948379, note that Satoru edited your question to make it harder (added " two" to the end of your string), and the answer now doesn't find any matches (for the reason I explained in my comment above). This is still harder than it looks ;-) – Tim Peters Nov 03 '13 at 05:46
  • @TimPeters I'm wondering if there is a simple way to check if a pattern is solvable using Regex ;p – satoru Nov 03 '13 at 05:53
  • @user2948379 Is it possible for a trailing `two` to occur in your inputs? If this is not the case, feel free to remove the `two` I added to your example and sorry about that. – satoru Nov 03 '13 at 05:56
1

With the harder string Satoru added, this works:

>>> import re
>>> a = "one two three four five six one three four seven two"
>>> re.findall("one(?!.*two.*four).*four", a)
['one three four']

But - someday - you're really going to regret writing tricky regexps. If this were a problem I needed to solve, I'd do it like this:

for m in re.finditer("one.*?four", a):
    if "two" not in m.group():
        break

It's tricky enough that I'm using a minimal match there (.*?). Regexps can be a real pain :-(

EDIT: LOL! But the messier regexp at the top fails yet again if you make the string harder still:

a = "one two three four five six one three four seven two four"

FINALLY: here's a correct solution:

>>> a = 'one two three four five six one three four seven two four'
>>> m = re.search("one([^t]|t(?!wo))*four", a)
>>> m.group()
'one three four'
>>> m.span()
(28, 42)

I know you said you wanted m.end() to be 41, but that was incorrect.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • 1
    The second version - `one.*?four` with a filter - will fail for `"one two one five four"`. There *must* be an elegant solution that just captures `one` `four` and `two`, and takes the right pairs. Python should be a good language for such solutions, but I don't know it too well... – Kobi Nov 03 '13 at 06:13
0

another one liner with a very simple pattern

import re
line = "one two three four five six one three four seven two"

print [X for X in [a.split()[1:-1] for a in 
                     re.findall('one.*?four', line, re.DOTALL)] if 'two' not in X]

gives me

>>> 
[['three']]
kiriloff
  • 25,609
  • 37
  • 148
  • 229