You have to keep in mind that when you're using regexes, they will try as much as they can to get a match (that is also one of the biggest weakness of regex and this is what often causes catastrophic backtracking). What this implies is that in your current regex:
[^A-Z]*[A-Z]{3}[^A-Z]*
[A-Z]{3}
is matching 3 uppercase letters, and both [^A-Z]*
are matching nothing (or empty strings). You can see how by using capture groups:
import re
theString = "ABCDE"
pattern = re.compile(r"([^A-Z]*)([A-Z]{3})([^A-Z]*)")
result = pattern.search(theString)
if result:
print("Matched string: {" + result.group(0) + "}")
print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")
else:
print("No match")
Prints:
Matched string: {ABC}
Sub match 1: {} 2. {ABC} 3. {}
ideone demo
Do you see what happened now? Since [^A-Z]*
can also accept 'nothing', that's exactly what it'll try to do and match an empty string.
What you probably wanted was to use something more like this:
([^A-Z]|^)[A-Z]{3}([^A-Z]|$)
It will match a string containing three consecutive uppercase letters when there is no more uppercase letters around it (the |^
means OR at the beginning and |$
means OR at the end). If you use that regex in the little script above, you will not get any match in ABCDE
which is what you wanted. If you use it on the string abcABC
, you get:
import re
theString = "abcABC"
pattern = re.compile(r"([^A-Z]|^)([A-Z]{3})([^A-Z]|$)")
result = pattern.search(theString)
if result:
print("Matched string: {" + result.group(0) + "}")
print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")
Prints:
Matched string: {cABC}
Sub match 1: {c} 2. {ABC} 3. {}
The [^A-Z]
is actually matching (or in better regex terms, consuming) a character and if you only care about checking whether or not the string contains only 3 uppercase characters in a row, that regex would suffice.
If you want to extract those uppercase characters, you can use a capture group like in the above example and use result.group(2)
to get it.
Actually, if you turn some capture groups into non-capture groups...
(?:[^A-Z]|^)([A-Z]{3})(?:[^A-Z]|$)
You can use result.group(1)
to get those 3 letters
Otherwise, if you don't mind using lookarounds (they can be a little harder to understand), you won't have to use capture groups. Vasili's answer shows exactly how you use them:
(?<![A-Z])[A-Z]{3}(?![A-Z])
(?<! ... )
is a negative lookbehind and will prevent a match if the pattern inside matches the previous character(s). In this case, if the previous character matches [A-Z]
the match will fail.
(?! ... )
is a negative lookahead and will prevent a match if the pattern inside matches the next character(s). In this case, if the next character matches [A-Z]
the match will fail. In this case, you can simply use .group()
to get those uppercase letters:
import re
theString = "abcABC"
pattern = re.compile(r"(?<![A-Z])[A-Z]{3}(?![A-Z])")
result = pattern.search(theString)
if result:
print("Matched string: {" + result.group() + "}")
ideone demo
I hope it wasn't too long :)