3

I need a regex statement that will check for three capital letters in a row.

For example, it should match: ABC, aABC, abcABC

But it should not match: AaBbBc, ABCDE

At the moment this is my statement:

'[^A-Z]*[A-Z]{3}[^A-Z]*'

But this matches ABCDE. What am I doing wrong?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hayley van Waas
  • 429
  • 3
  • 12
  • 21
  • Are you using `re.match` or `re.search`? – Peter DeGlopper Feb 06 '14 at 05:41
  • re.search, which is better in this situation? – Hayley van Waas Feb 06 '14 at 05:43
  • Either use `re.match`, which requires that the whole input string match the regexp, or `re.search` with `^` and `$` characters included to limit it to matching the bounds of the input string. http://stackoverflow.com/questions/180986/what-is-the-difference-between-pythons-re-search-and-re-match for a more detailed explanation. `re.search` returns a match if any substring of the input string matches the regexp, which `ABCDE` does. – Peter DeGlopper Feb 06 '14 at 05:44
  • Ah! It worked with `re.match` Brilliant! Thanks! – Hayley van Waas Feb 06 '14 at 05:46
  • Apologies for the misstatement regarding `re.match` - that will return a match object if the beginning of the string matches the regexp, even if the regexp does not consume the whole string. That explains the false positive on `ABCDE` using `re.match` without an explicit match against the end-of-string marker `$`. – Peter DeGlopper Feb 06 '14 at 06:00
  • Should your regexp match 'ABcABCde'? That has three capitals in a row, but your attempted regexp rejects it. – Peter DeGlopper Feb 06 '14 at 06:15
  • Yes it should match 'ABcABCde' – Hayley van Waas Feb 06 '14 at 06:19
  • For this you can consider not using regexps, since you need lookaheads and behinds and the regexp gets uglier. A small function that loops over the chars and returns groups of three uppercase chars is easy to implement. – bgusach Feb 06 '14 at 08:06

5 Answers5

5

Regex

(?<![A-Z])[A-Z]{3}(?![A-Z])

Explanation

I specified a negative lookbehind and a negative lookahead before and after the middle regex for three capitals in a row, respectively.

This is a better option compared to using a negated character class because it will successfully match even when there are no characters to the left or right of the string.

Online Demonstration

DEMO


As for the Python code, I haven't figured out how to print out the actual matches, but this is the syntax:

Using re.match:

>>> import re
>>> p = re.compile(r'(?<![A-Z])[A-Z]{3}(?![A-Z])')
>>> s = '''ABC
... aABC
... abcABCabcABCDabcABCDEDEDEDa
... ABCDE'''
>>> result = p.match(s)
>>> result.group()
'ABC'

Using re.search:

>>> import re
>>> p = re.compile(r'(?<![A-Z])[A-Z]{3}(?![A-Z])')
>>> s = 'ABcABCde'
>>> p.search(s).group()
'ABC'
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
  • Thank you Burhan, I have been wanting to know how to do that for a while :) – Vasili Syrakis Feb 06 '14 at 06:11
  • 2
    This with `p.search` rather than `p.match` passes my casual tests, such as `p.search('aABC')` (`match` fails there) and `p.search('ABcABCde')`. – Peter DeGlopper Feb 06 '14 at 06:27
  • I just tested and I agree, `re.search` seems to work better here. I've added it to the answer. – Vasili Syrakis Feb 06 '14 at 06:47
  • It makes sense, since `match` requires that the regexp match at the beginning of the string, and the negative lookbehind doesn't actually consume any characters. I'm sure there's a way to write it such that `match` would work, but it's already pretty complex. – Peter DeGlopper Feb 06 '14 at 06:51
  • `.match` needs the whole string to match the regexp, whereas `.search` matches any substring. If you are looking for groups of 3 uppercase letters, you better use `search` – bgusach Feb 06 '14 at 08:09
3

In your regular expression, the [^A-Z]* at the beginning and end is saying "Look for any number of non-capital letters, including 0." And so, ABCDE will satisfy your regular expression. For example, A can be seen as "0 non-capital letters" followed by BCD followed by E, which is also "0 non-capital letters."

I think what you want to do instead is craft a regular expression that looks for:

  1. Either "a non-capital letter" or the start of my string.
  2. Followed by "exactly 3 capital letters."
  3. Followed by "a non-capital letter" or the end of my string.

It doesn't matter how many non-capital letters precede or follow your 3 capital letters, as long as there is at least 1. So, you just need to look for 1.

Try this:

(^|[^A-Z])[A-Z]{3}([^A-Z]|$)

Note that the first ^ means start of string, which is different from the meaning of ^ inside the brackets. The $ means end of string.

Tested in ruby, here is what we have:

regexp = /(^|[^A-Z])[A-Z]{3}([^A-Z]|$)/
'ABC'.match(regexp)    # returns a match
'aABC'.match(regexp)   # returns a match
'abcABC'.match(regexp)  # returns a match
'AaBbBc'.match(regexp) # returns nil
'ABCDE'.match(regexp)  # returns nil
Alvin S. Lee
  • 4,984
  • 30
  • 34
  • Almost exactly the same as the answer I was just about to post! I think this is better than the solution with the negative look-behind and look-ahead because it's more efficient. Zero-length assertions are powerful, but very taxing on the regex engine, so in most cases they should be avoided if there's a simple alternative without them. – Adi Inbar Feb 06 '14 at 06:06
  • The `^$` anchors are also Zero-length assertions. I don't understand the use of alternations in the above regex though, you could just use the lazy `?` quantifier. – Vasili Syrakis Feb 06 '14 at 06:07
  • Along with @VasiliSyrakis's answer and @Jerry's detailed explanation, this is one of the only ones that correctly matches `'ABcDEF'` and similar test cases, when used with `search` rather than `match`. I include my own now-deleted answer in the failures. – Peter DeGlopper Feb 06 '14 at 06:28
  • Why "`'abcABC.match`"? Shouldn't it be "`'abcABC'.match`" (single quote after "C")? – Peter Mortensen Nov 08 '21 at 05:14
  • Typo, fixed. Thanks for catching that, Peter. – Alvin S. Lee Nov 17 '21 at 07:49
2

You have to keep in mind that when you're using regexes, they will try as much as they can to get a match (that is also one of the biggest weakness of regex and this is what often causes catastrophic backtracking). What this implies is that in your current regex:

[^A-Z]*[A-Z]{3}[^A-Z]*

[A-Z]{3} is matching 3 uppercase letters, and both [^A-Z]* are matching nothing (or empty strings). You can see how by using capture groups:

import re
theString = "ABCDE"
pattern = re.compile(r"([^A-Z]*)([A-Z]{3})([^A-Z]*)")
result = pattern.search(theString)

if result:
    print("Matched string: {" + result.group(0) + "}")
    print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")
else:
    print("No match")

Prints:

Matched string: {ABC}
Sub match 1: {} 2. {ABC} 3. {}

ideone demo

Do you see what happened now? Since [^A-Z]* can also accept 'nothing', that's exactly what it'll try to do and match an empty string.

What you probably wanted was to use something more like this:

([^A-Z]|^)[A-Z]{3}([^A-Z]|$)

It will match a string containing three consecutive uppercase letters when there is no more uppercase letters around it (the |^ means OR at the beginning and |$ means OR at the end). If you use that regex in the little script above, you will not get any match in ABCDE which is what you wanted. If you use it on the string abcABC, you get:

import re
theString = "abcABC"
pattern = re.compile(r"([^A-Z]|^)([A-Z]{3})([^A-Z]|$)")
result = pattern.search(theString)

if result:
    print("Matched string: {" + result.group(0) + "}")
    print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")

Prints:

Matched string: {cABC}
Sub match 1: {c} 2. {ABC} 3. {}

The [^A-Z] is actually matching (or in better regex terms, consuming) a character and if you only care about checking whether or not the string contains only 3 uppercase characters in a row, that regex would suffice.


If you want to extract those uppercase characters, you can use a capture group like in the above example and use result.group(2) to get it.

Actually, if you turn some capture groups into non-capture groups...

(?:[^A-Z]|^)([A-Z]{3})(?:[^A-Z]|$)

You can use result.group(1) to get those 3 letters

Otherwise, if you don't mind using lookarounds (they can be a little harder to understand), you won't have to use capture groups. Vasili's answer shows exactly how you use them:

(?<![A-Z])[A-Z]{3}(?![A-Z])

(?<! ... ) is a negative lookbehind and will prevent a match if the pattern inside matches the previous character(s). In this case, if the previous character matches [A-Z] the match will fail.

(?! ... ) is a negative lookahead and will prevent a match if the pattern inside matches the next character(s). In this case, if the next character matches [A-Z] the match will fail. In this case, you can simply use .group() to get those uppercase letters:

import re
theString = "abcABC"
pattern = re.compile(r"(?<![A-Z])[A-Z]{3}(?![A-Z])")
result = pattern.search(theString)

if result:
    print("Matched string: {" + result.group() + "}")

ideone demo

I hope it wasn't too long :)

Jerry
  • 70,495
  • 13
  • 100
  • 144
0

You can use this:

    '^(?:.*[^A-Z])?[A-Z]{3}(?:[^A-Z].*)?$'

Explanation:

  • ^,$ to match start and end of line.
  • (?:.*[^A-Z])? to check that the previous character is not capital (if any).
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MIE
  • 444
  • 2
  • 9
0

Your regex has all explanation that what you are doing wrong

'[^A-Z]*[A-Z]{3}[^A-Z]*'

If ^ is used inside character set i.e [] which means ignore, so your regex would ignore if it starts A-Z (capital letters) either one or more at the starting. But as per your example, I think you don't want that

[A-Z]{3} means it will exactly match three capital letters in a row.

[^A-Z]* means the same what I explained for the first one.

If you write '[A-Z]{3}' only, it would match exactly first three consecutive capital letters at anywhere in the string.

It would match ABCde abCDE aBCDe ABCDE but it would not match abcDE ABcDE AaBcCc

Just try it.

Example in Perl

#!/usr/bin/perl
use strict;
use warnings;

my @arr = qw(AaBsCc abCDE ABCDE AbcDE abCDE ABC aABC abcABC);

foreach my $string(@arr){
  if($string =~ m/[A-Z]{3}/){
    print "Matched $string\n";
  }
  else {
    print "Didn't match $string \n";
  }
}

Output:

Didn't match AaBsCc
Matched abCDE
Matched ABCDE
Didn't match AbcDE
Matched abCDE
Matched ABC
Matched aABC
Matched abcABC
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jassi
  • 521
  • 6
  • 31