I am just facing a probem when trying to create a regex which should help finding strings including specific combinations of substrings.
For example i am searching for the substring combination:
ab-ab-cd
1) "xxxabxxxxxxabxxxxcdxxx" -> should be a match
2) "xxxabxxxxabxxxxabxxxxcdxxxx -> no match
3) "xxxabxxxxxxxxxxcdxxxx -> no match
to make it even more complicated:
4) "xxxabxxxxxabxxxxcdxxxabxxx -> should also be a match
My substring combinations could also be like this:
ab-cd
or
ab-ab-ab-cd
or
ab-cd-ab-cd
For all these (and more) examples I am looking for a systematic way to build the corresponding regexes in a systematic way so that only strings are found as matches where the substrings occur in the right order and with correct frequency.
I got something like this for the "ab-ab-cd" substring search but it fails in cases like 4) of my examples.
p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)
In cases like 4) this one works in but in also matches strings like 2):
p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)
Could you please point me to my mistake?
Thanks a lot!
EDIT:
Sorry to all, that my question was not clear enough. I tried to break my problem down into a more simple case, which might have been no good idea. Here comes the detailed explanation of the problem:
I have list of (protein) sequences and to assign a specific type to each sequence on the basis of sequence patterns.
Therefore I created a dictionary with type-name as key and feature template (list of sequence features in a specific order) as value, e.g.:
type_a -> [A,A,B,C]
type_b -> [A,B,C]
type_c -> [A,B,A,B]
In other dict I have (simple) regex patters for each feature, e.g.:
A -> [PHT]AG[QP]LI
B -> RS[TP]EV
C -> ...
D -> ...
Now each template (type_a, type_b,...) I now to systematically build the concatenated regex patters (i.e. for type_a build a regex searching for A,A,B,C). That would than result into another dict with types as key and and the complete regex as value.
Now I want to go through each sequence in my list of sequences and map all complete regex templates against each sequence. In best case, only one complete regex (type) should match the sequence.
Taking the example from above, having the following regex-templates:
cd
ab-cd
ab-ab-cd
ab-ab-ab-cd
ab-cd-ab-cd
ab-ab-cd-ab
"xxxabxxxxxxabxxxxcdxxx"
->this sequence should match the regex for the template "ab-ab-cd" and not any of the others
With the following regex I could perfectly look for ab-ab-cd.
p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)
If my tests were correct it would only match sequence 1) from above and not 2) or 3).
However, if I would like to search for ab-ab-cd-ab the negative look-ahead would not allow to find the last ab. I found something like the following code to break the negative look-ahead after the second "ab" part. In my understand the negative look-ahead should stop with the "cd", so that the last "ab" could be matches again.
p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)
It solves the problem with the last "ab" from ab-ab-cd-ab. But somehow it now does not only match the for 2 times "ab" before the "cd" (Sequence 1) - ab-ab-cd) but also the 3 (or more) times "ab" before the "cd" (Sequence 2, ab-ab-ab-cd), which it should not.
I hope my problem is more clear. Thanks a lot for all the answers, I will try the code tomorrow when I am back at work. Any further answers are highly appreciated, explanations of the regex code (I am pretty new to regex) and suggestions with re.functions (match, final...) to use.
Thanks