4

I need to find all the strings matching a pattern with the exception of two given strings.

For example, find all groups of letters with the exception of aa and bb. Starting from this string:

-a-bc-aa-def-bb-ghij-

Should return:

('a', 'bc', 'def', 'ghij')

I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)

I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:

>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)

I tried with negative look ahead, but I couldn't find a working solution for this case.

stenci
  • 8,290
  • 14
  • 64
  • 104
  • You want [`findall`](https://docs.python.org/2/library/re.html#re.findall) - `search` is only supposed to return the first match :) – cxw Sep 20 '16 at 17:22

3 Answers3

6

You can make use of negative look aheads.

For example,

>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']

  • - Matches -

  • (?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb

  • ([^-]+) Matches ony or more character other than -


Edit

The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,

>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)
nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52
  • Just FYI: the `(?!aa|bb)` lookahead disallows those matches that *start* with `aa` or `bb`. So, say, `aacn` [will not be matched](https://regex101.com/r/jR5sH1/1). – Wiktor Stribiżew Sep 20 '16 at 17:55
  • @WiktorStribiżew Valid point. I have added an edit to the answer. Thanks for pointing out :) – nu11p01n73R Sep 20 '16 at 18:01
  • Yeah, I just think the `-` at the end is actually required - just judging by the OP input string. If there is no `-` at the end your regex will return a match, mine won't. This point is not clear, but I got a downvote for it, I guess. – Wiktor Stribiżew Sep 20 '16 at 18:04
  • @WiktorStribiżew I guess OP meant to say that `-` is required at the end. And this makes the pattern more simple compare to the one without. – nu11p01n73R Sep 20 '16 at 18:16
  • If a hyphen at the end is required, then `(?=-)` is required at the end. I kept it in my pattern. – Wiktor Stribiżew Sep 20 '16 at 18:18
  • I finally understand the negative look ahead: the `(?! ... )` checks whether something is there or not and captures nothing, while the following `(...)` is in charge of capturing whatever I need. I was trying to perform both the tasks in one step. Yay!! – stenci Sep 20 '16 at 19:05
  • @stenci, Yeah you are right. Its more like checking but not matching. Its just check what follows and comes back to the same position and continues if that check is false for negative lookahead. ( and true for positive look ahead.) I tried explaining this in detail in [another answer here](http://stackoverflow.com/questions/27691225/understanding-negative-lookahead/27691287#27691287). Hope it helps. :) – nu11p01n73R Sep 20 '16 at 19:10
2

You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.

Use

res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)

or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:

res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)

Here is the regex demo.

Details:

  • - - a hyphen
  • (?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
  • (\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
  • (?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).

Python demo:

import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I will add this remark here, too: I think the last lookahead is required because the last match is only valid if it is followed with `-`. That is deduced from the OP string, so not that sure. – Wiktor Stribiżew Sep 20 '16 at 18:13
0

Although a regex solution was asked for, I would argue that this problem can be solved easier with simpler python functions, namely string splitting and filtering:

input_list = "-a-bc-aa-def-bb-ghij-"
exclude = set(["aa", "bb"])
result = [s for s in input_list.split('-')[1:-1] if s not in exclude]

This solution has the additional advantage that result could also be turned into a generator and the result list does not need to be constructed explicitly.

David Zwicker
  • 23,581
  • 6
  • 62
  • 77