0

I want to search AA*ZZ only if * does not contain XX.

For 2 strings:

"IY**AA**BMDHRPONWUY**ZZ**"
"BV**AA**BDMYB**XX**W**ZZ**CKU"

how can I match regex only with the first one?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
ghchoi
  • 4,812
  • 4
  • 30
  • 53
  • Show us what you've tried. – CinCout Jul 22 '19 at 04:40
  • What you can always do is split this job in two: first find all lines that match `AA(.*)ZZ` and then filter out those that contain `XX` inside. This way Regex is also cleaner, the intention behind the code is more visible to anyone reading the code, etc. etc. – Asunez Jul 22 '19 at 07:33

3 Answers3

1

If you only want to match characters A-Z, you might use

AA(?:[A-WYZ]|X(?!X))*ZZ

Explanation

  • AA Match literally
  • (?:
    • [A-WYZ] Match A-Z except X
    • | or
    • X(?!X) Match X and assert what is directly to the right is not X
  • )* Close non capturing group and repeat 0+ times
  • ZZ Match literally

Regex demo

If there also can be other characters another option could be to use a negated character class [^\sX] matching any char except X or a whitespace char:

AA(?:[^\sX]|X(?!X))*ZZ

Regex demo

Another option is to use a tempered greedy token:

AA(?:(?!\btest\b).)*BB

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Can it be more general? What I want is not except for a character but except for a sub-string. (Hopefully for sub-strings...) – ghchoi Jul 22 '19 at 07:35
  • 1
    @GyuHyeonChoi I have added another example using a [tempered greedy token](https://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat) – The fourth bird Jul 22 '19 at 08:03
  • Thank you so much! Exactly what I was looking for! I tried a similar approach but I did not know I should add `\b` before and after a sub-string. So `\b` is the magic here, right? – ghchoi Jul 22 '19 at 08:48
  • @GyuHyeonChoi No there is no magic, it is a word boundary that prevent the current match being part of a larger word. – The fourth bird Jul 22 '19 at 08:50
0

Posting my original comment to the question as an answer

Apart from "single-regex" solutions already posted, think about this solution:

  1. First, find all matches for any text between AA and ZZ, for example with this regex: AA(.+)ZZ. Store all matches in a list.
  2. Loop through (or use filter functions, if available) the list of matches from previous steps and remove the ones that do not contain XX. You do not even need to use Regex for that, as most languages, including Python, have dedicated string methods for that.

What you get in return is a clean solution, without any complicated Regexes. It's easy to read, easy to maintain, and if any new conditions are to be added they can be applied at the final result.

To support it with some code (you can test it here):

import re


test_str = """
IYAABMDHRPONWUYZZ
BVAABDMYBXXWZZCKU
"""

# First step: find all strings between AA and ZZ
match_results = re.findall("AA(.+)ZZ", test_str, re.I)

# Second step: filter out the ones that contain XX
final_results = [match for match in match_results if not ("XX" in match)]

print(final_results)

As for the part assigned to final_results, it's called list comprehension. Since it's not part of the question, I'll not explain it here.

Asunez
  • 2,327
  • 1
  • 23
  • 46
-1

My guess is that you might probably, not sure though, want to design an expression similar to:

^(?!.*(?=AA.*XX.*ZZ).*).*AA.*ZZ.*$

Test

import re

regex = r"^(?!.*(?=AA.*XX.*ZZ).*).*AA.*ZZ.*$"

test_str = """
IYAABMDHRPONWUYZZ
BVAABDMYBXXWZZCKU
AABMDHRPONWUYXxXxXxZZ
"""

print(re.findall(regex, test_str, re.M))

Output

['IYAABMDHRPONWUYZZ', 'AABMDHRPONWUYXxXxXxZZ']

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Emma
  • 27,428
  • 11
  • 44
  • 69