4

HI,

I'm trying to match UK postcodes, using the pattern from http://interim.cabinetoffice.gov.uk/media/291370/bs7666-v2-0-xsd-PostCodeType.htm,

/^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z-[CIKMOV]]{2}$/

I'm using this in PHP, but it doesn't match the valid postcode OL13 0EF. This postcode does match, however, when I remove the -[CIKMOV] character class subtraction.

I get the impression that I'm doing character class subtraction wrong in PHP. I'd be most grateful if anyone could correct my error.

Thanks in advance for your help.

Ross

Ross McFarlane
  • 4,054
  • 4
  • 36
  • 52

4 Answers4

7

Most of the regex flavours do not support character class subtraction. Instead you could use look-ahead assertion:

/^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9](?!.?[CIKMOV])[A-Z]{2}$/
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
  • I don't really get how this is "cleaner". It is the cooler solution, not doubt, but way more cryptic than the other solutions. – fresskoma Jan 24 '11 at 16:10
  • This is not a pure character class solution and its ambigous. Change the {2} to a {3} a year from now, then try to debug it. –  Jan 24 '11 at 20:21
  • you don't say. tomorrow they'll change to digit-only postcodes, and you'll have to *re-write* the regex altogether! – SilentGhost Jan 24 '11 at 20:29
5

If class subtraction is not supported, you should be able to use negative classes to achieve subtractions.

Some examples are [^\D] = \d, [^[:^alpha:]] = [a-zA-Z]

Your problem could be solved like that, using a negative POSIX character class inside a character class like [^a-z[:^alpha:]CIKMOV]

[^
a-z # not a-z
[:^alpha:] # not not A-Za-z
CIKMOV # not C,I,K,M,O,V
]

Edit - This works too and might be easier to read: [^[:^alpha:][:lower:]CIKMOV]

[^
[:^alpha:] # A-Za-z
[:lower:] # not a-z
CIKMOV # not C,I,K,M,O,V
]

The result is a character class that is A-Z without C,I,K,M,O,V
basically a subtraction.

Here is a test of 2 different class concoctions (in Perl):

use strict;
use warnings;

my $match = '';

   # ANYOF[^\0-@CIKMOV[-\377!utf8::IsAlpha]
for (0 .. 255) {
   if (chr($_) =~ /^[^a-z[:^alpha:]CIKMOV]$/) {
       $match .= chr($_); next;
   }
   $match .= ' ';
}
$match =~ s/^ +//;
$match =~ s/ +$//;
print "'$match'\n";
$match = '';

   # ANYOF[^\0-@CIKMOV[-\377+utf8::IsDigit !utf8::IsWord]
for (0 .. 255) {
   if (chr($_) =~ /^[^a-z\d\W_CIKMOV]$/) {
       $match .= chr($_); next;
   }
   $match .= ' ';
}
$match =~ s/^ +//;
$match =~ s/ +$//;
print "'$match'\n";

Output shows the discontinuation in A-Z minus CIKMOV, from tested ascii characters 0-255:
'AB DEFGH J L N PQRSTU WXYZ'
'AB DEFGH J L N PQRSTU WXYZ'

  • this would require ascii input. – SilentGhost Jan 24 '11 at 20:28
  • @SilentGhost Internally in perl everything is a byte string, encode to go out, decode to come in. codepoints are as usual otherwise no regex. –  Jan 24 '11 at 21:49
  • @Silent, Yea if its not in the range of predetermined classes and there is no subtraction class, then another alternative is needed. This just happens to be in that range. –  Jan 24 '11 at 22:02
  • This is awesome. It achieves the same as character class subtraction. – Quinn Comendant Aug 10 '15 at 02:37
4

PCRE does not support char class subtraction.

So you can enumerate all the uppercase letters except CIKMOV:

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABDEFGHJLNPQRSTUWXYZ]{2}$

which can be shorted using range as:

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-JLNP-UW-Z]{2}$
codaddict
  • 445,704
  • 82
  • 492
  • 529
1

I think you're going to have to replace [A-Z-[CIKMOV]] with [ABD-HJLNP-UW-Z]. I don't think php supports character class substraction. My alternative reads something like "A, B, D to H, J, L, N, P to U, and W to Z".

cambraca
  • 27,014
  • 16
  • 68
  • 99