Regex Character Class Subtraction with PHP

Question

HI,

I'm trying to match UK postcodes, using the pattern from http://interim.cabinetoffice.gov.uk/media/291370/bs7666-v2-0-xsd-PostCodeType.htm,

/^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z-[CIKMOV]]{2}$/

I'm using this in PHP, but it doesn't match the valid postcode OL13 0EF. This postcode does match, however, when I remove the -[CIKMOV] character class subtraction.

I get the impression that I'm doing character class subtraction wrong in PHP. I'd be most grateful if anyone could correct my error.

Thanks in advance for your help.

Ross

SilentGhost · Accepted Answer · 2011-01-24T16:14:23.567

7

Most of the regex flavours do not support character class subtraction. Instead you could use look-ahead assertion:

/^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9](?!.?[CIKMOV])[A-Z]{2}$/

edited Jan 24 '11 at 16:14

answered Jan 24 '11 at 16:04

SilentGhost

307,395
66
306
293

I don't really get how this is "cleaner". It is the cooler solution, not doubt, but way more cryptic than the other solutions. – fresskoma Jan 24 '11 at 16:10
This is not a pure character class solution and its ambigous. Change the {2} to a {3} a year from now, then try to debug it. – Jan 24 '11 at 20:21
you don't say. tomorrow they'll change to digit-only postcodes, and you'll have to *re-write* the regex altogether! – SilentGhost Jan 24 '11 at 20:29

score 5 · Answer 2 · 2011-01-24T20:34:01.087

If class subtraction is not supported, you should be able to use negative classes to achieve subtractions.

Some examples are [^\D] = \d, [^[:^alpha:]] = [a-zA-Z]

Your problem could be solved like that, using a negative POSIX character class inside a character class like [^a-z[:^alpha:]CIKMOV]

[^
a-z # not a-z
[:^alpha:] # not not A-Za-z
CIKMOV # not C,I,K,M,O,V
]

Edit - This works too and might be easier to read: [^[:^alpha:][:lower:]CIKMOV]

[^
[:^alpha:] # A-Za-z
[:lower:] # not a-z
CIKMOV # not C,I,K,M,O,V
]

The result is a character class that is A-Z without C,I,K,M,O,V
basically a subtraction.

Here is a test of 2 different class concoctions (in Perl):

use strict;
use warnings;

my $match = '';

   # ANYOF[^\0-@CIKMOV[-\377!utf8::IsAlpha]
for (0 .. 255) {
   if (chr($_) =~ /^[^a-z[:^alpha:]CIKMOV]$/) {
       $match .= chr($_); next;
   }
   $match .= ' ';
}
$match =~ s/^ +//;
$match =~ s/ +$//;
print "'$match'\n";
$match = '';

   # ANYOF[^\0-@CIKMOV[-\377+utf8::IsDigit !utf8::IsWord]
for (0 .. 255) {
   if (chr($_) =~ /^[^a-z\d\W_CIKMOV]$/) {
       $match .= chr($_); next;
   }
   $match .= ' ';
}
$match =~ s/^ +//;
$match =~ s/ +$//;
print "'$match'\n";

Output shows the discontinuation in A-Z minus CIKMOV, from tested ascii characters 0-255:
'AB DEFGH J L N PQRSTU WXYZ'
'AB DEFGH J L N PQRSTU WXYZ'

@SilentGhost Internally in perl everything is a byte string, encode to go out, decode to come in. codepoints are as usual otherwise no regex. — , Jan 24 '11 at 21:49
@Silent, Yea if its not in the range of predetermined classes and there is no subtraction class, then another alternative is needed. This just happens to be in that range. — , Jan 24 '11 at 22:02
This is awesome. It achieves the same as character class subtraction. — Quinn Comendant, Aug 10 '15 at 02:37

score 4 · Answer 3 · answered Jan 24 '11 at 16:04

PCRE does not support char class subtraction.

So you can enumerate all the uppercase letters except CIKMOV:

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABDEFGHJLNPQRSTUWXYZ]{2}$

which can be shorted using range as:

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-JLNP-UW-Z]{2}$

score 1 · Answer 4 · answered Jan 24 '11 at 16:04

1

I think you're going to have to replace [A-Z-[CIKMOV]] with [ABD-HJLNP-UW-Z]. I don't think php supports character class substraction. My alternative reads something like "A, B, D to H, J, L, N, P to U, and W to Z".

answered Jan 24 '11 at 16:04

cambraca

27,014
16
68
99

Regex Character Class Subtraction with PHP

4 Answers4