Searching strings where substring occur at specific positions with negative look-ahead

Question

I am just facing a probem when trying to create a regex which should help finding strings including specific combinations of substrings.

For example i am searching for the substring combination:

ab-ab-cd

1) "xxxabxxxxxxabxxxxcdxxx" -> should be a match

2) "xxxabxxxxabxxxxabxxxxcdxxxx -> no match

3) "xxxabxxxxxxxxxxcdxxxx -> no match

to make it even more complicated:

4) "xxxabxxxxxabxxxxcdxxxabxxx -> should also be a match

My substring combinations could also be like this:

ab-cd

or

ab-ab-ab-cd

or

ab-cd-ab-cd

For all these (and more) examples I am looking for a systematic way to build the corresponding regexes in a systematic way so that only strings are found as matches where the substrings occur in the right order and with correct frequency.

I got something like this for the "ab-ab-cd" substring search but it fails in cases like 4) of my examples.

p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)

In cases like 4) this one works in but in also matches strings like 2):

p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)

Could you please point me to my mistake?

Thanks a lot!

EDIT:

Sorry to all, that my question was not clear enough. I tried to break my problem down into a more simple case, which might have been no good idea. Here comes the detailed explanation of the problem:

I have list of (protein) sequences and to assign a specific type to each sequence on the basis of sequence patterns.

Therefore I created a dictionary with type-name as key and feature template (list of sequence features in a specific order) as value, e.g.:

type_a -> [A,A,B,C]

type_b -> [A,B,C]

type_c -> [A,B,A,B]

In other dict I have (simple) regex patters for each feature, e.g.:

A -> [PHT]AG[QP]LI

B -> RS[TP]EV

C -> ...

D -> ...

Now each template (type_a, type_b,...) I now to systematically build the concatenated regex patters (i.e. for type_a build a regex searching for A,A,B,C). That would than result into another dict with types as key and and the complete regex as value.

Now I want to go through each sequence in my list of sequences and map all complete regex templates against each sequence. In best case, only one complete regex (type) should match the sequence.

Taking the example from above, having the following regex-templates:

cd

ab-cd

ab-ab-cd

ab-ab-ab-cd

ab-cd-ab-cd

ab-ab-cd-ab

"xxxabxxxxxxabxxxxcdxxx"

->this sequence should match the regex for the template "ab-ab-cd" and not any of the others

With the following regex I could perfectly look for ab-ab-cd.

p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)

If my tests were correct it would only match sequence 1) from above and not 2) or 3).

However, if I would like to search for ab-ab-cd-ab the negative look-ahead would not allow to find the last ab. I found something like the following code to break the negative look-ahead after the second "ab" part. In my understand the negative look-ahead should stop with the "cd", so that the last "ab" could be matches again.

p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)

It solves the problem with the last "ab" from ab-ab-cd-ab. But somehow it now does not only match the for 2 times "ab" before the "cd" (Sequence 1) - ab-ab-cd) but also the 3 (or more) times "ab" before the "cd" (Sequence 2, ab-ab-ab-cd), which it should not.

I hope my problem is more clear. Thanks a lot for all the answers, I will try the code tomorrow when I am back at work. Any further answers are highly appreciated, explanations of the regex code (I am pretty new to regex) and suggestions with re.functions (match, final...) to use.

Thanks

Should the combination ab-ab-cd match strings like `ab_cd_ab_cd` ? — Aran-Fey, Jul 06 '17 at 16:25
Your question doesn't make sense if you don't say which re method you want to use: `re.search`, `re.match`, `re.findall`...? — Casimir et Hippolyte, Jul 06 '17 at 16:29
Try [`^(?:(?!ab).)*ab(?:(?!ab).)*ab(?:(?!ab).)*cd.*$`](https://regex101.com/r/bceV4x/1). But the question does not sound that clear though. — Wiktor Stribiżew, Jul 06 '17 at 16:36
I tried to update my question to be more clear. Hope that helps to understand what I am looking for. Also recommendations which re.functions to use for my problems would be highly appreciated. I am not very experienced with regex. — Sefu, Jul 06 '17 at 20:51
Wiktor's regex seems to solve my problem. Thanks a lot. Can anyone provide an explanation of the regex (especially in comparison to my proposal which are not working as they should)? — Sefu, Jul 07 '17 at 08:03
@WiktorStribiżew: I tried to add extend the reges a bit: `p = re.compile("^(?:(?!ab).)*ab(?:(?!ab).)*ab(?:(?!ab).)*(?:(?!cd).)*cd(?:(?!cd).)*$", re.IGNORECASE`. This should match ab-ab-cd but not ab-ab-cd-cd. However, it does mathc ab-ab-cd-cd like strings... — Sefu, Jul 07 '17 at 08:22

score 2 · Answer 1 · answered Jul 06 '17 at 16:31

2

You could use re.findall and post-process it. Effectively you want to find all instances of ab or cd and see if your pattern(['ab', 'ab', 'cd']) is at the start of the list. The following:

import re

test1 = "xxxabxxxxxxabxxxxcdxxx"
test2 = "xxxabxxxxabxxxxabxxxxcdxxxx"
test3 = "xxxabxxxxxxxxxxcdxxxx"
test4 = "xxxabxxxxxabxxxxcdxxxabxxx"

for x in (test1, test2, test3, test4):
    matches = re.findall(r'(ab|cd)', x)
    print matches[:3] == ['ab', 'ab', 'cd']

prints

True
False
False
True

As required.

answered Jul 06 '17 at 16:31

asongtoruin

9,794
3
36
47

Thanks for the proposal. What does the "r" in front of '(ab|cd)' stand for? – Sefu Jul 07 '17 at 15:33
@Sefu it's a raw string literal, see [this question](https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-in-python-and-what-are-raw-string-l) – asongtoruin Jul 10 '17 at 08:44

Uri Y · Accepted Answer · 2017-07-06T16:50:43.663

0

Why do you need the negative look ahead? Why not use something as simple as that:

*ab.*ab.*cd

Or if you need it to find a match from the beginning of the line, you can use:

^.*ab.*ab.*cd

Edit: After your comment I understood what you need. Try this one:

^(?:(?!ab).)*ab(?:(?!ab).)*ab(?:(?!ab).)*cd

edited Jul 06 '17 at 16:50

answered Jul 06 '17 at 16:11

Uri Y

840
5
13

This matches `xxxabxxxxabxxxxabxxxxcdxxxx` even though it shouldn't. – Aran-Fey Jul 06 '17 at 16:15
could you explain the regex a bit there what it's doing? I'm having trouble with the first part, does not start with ab I think? (?:(?!ab).)* – sniperd Jul 06 '17 at 17:51
Explanation about the regex:the first part (?:(?!ab).)* finds any character that is not preceded by ab, after that it looks for ab, this repeats twice and then again any character that is not preceded by ab and then cd. See link: https://regex101.com/r/UawK16/1 – Uri Y Jul 07 '17 at 00:05
@UriY: I tried to add extend the regex a bit: p = re.compile("^(?:(?!ab).)*ab(?:(?!ab).)*ab(?:(?!ab).)*(?:(?!c‌d).)*cd(?:(?!cd).)*$‌", re.IGNORECASE. This should match ab-ab-cd but not ab-ab-cd-cd. However, it does mathc ab-ab-cd-cd like strings... – Sefu Jul 07 '17 at 09:19

Searching strings where substring occur at specific positions with negative look-ahead

2 Answers2