13

I have tried several regexes and still some valid postal codes sometimes get rejected.

Searching the internet, Wikipedia and SO, I could only find regex validation solutions.

Is there a validation method which does not use regex? In any language, I guess it would be easy to port.

I supose the easiest would be to compare against a postal code database, yet that would need to be maintained and updated periodically from a reliable source.

Edit: To help future visitors and keep you from posting any more regexes, here's a regex which I have tested (as of 2013-04-24) to work for all postal codes in Code Point (see @Mikkel Løkke's answer):

//PHP PCRE (it was on Wikipedia, it isn't there anymore; I might have modified it, don't remember).
$strPostalCode=preg_replace("/[\s]/", "", $strPostalCode);
$bValid=preg_match("/^(GIR 0AA)|(((A[BL]|B[ABDHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9][0-9])|EC[1-9][0-9])[0-9][ABD-HJLNP-UW-Z]{2})$/i", $strPostalCode);
Community
  • 1
  • 1
oxygen
  • 5,891
  • 6
  • 37
  • 69
  • 2
    Why do you care whether it uses a regex or not? – Philip Kendall Apr 11 '13 at 22:56
  • 5
    Regexes are hard to debug, hard to port from one regex flavor to another (silent "errors"), and hard to update. UK has the most complicated regexes out of all the postal code regexes out there for postal code validation. While I am using regexes for any other country (except for two countries for which I can match states/provinces with the postal code), for the UK I would like something more solid and much easier to fix when something doesn't work. – oxygen Apr 11 '13 at 23:03
  • Updating from Code Point periodically is not what I have in mind (it has to be done often, as to not reject valid newly assigned postcodes). A more permissive general rule is better suited to my particular needs. While the above mentioned regex accomplishes this, it is not easy to update or port it. Several answers proposed deriving the rules back from the regex, or understanding those Wikipedia style rules. I am starting to think it would be better to start from the data provided by CodePoint (see mikkel lokke's answer) (besides the postal codes, CodePoint explains the area codes and such). – oxygen Apr 25 '13 at 10:07
  • Ever considered posting a CURL request to http://www.royalmail.com/postcode-finder/? – Daryl Gill Apr 30 '13 at 16:03

8 Answers8

22

I'm writing this answer based on the wiki page.

When checking on the validation part, it seems that there are 6 type of formats (A = letter and 9 = digit):

AA9A 9AA                       AA9A9AA                   AA9A9AA
A9A 9AA     Removing space     A9A9AA       order it     AA999AA
A9 9AA    ------------------>  A99AA     ------------->  AA99AA
A99 9AA                        A999AA                    A9A9AA
AA9 9AA                        AA99AA                    A999AA
AA99 9AA                       AA999AA                   A99AA

As we can see, the length may vary from 5 to 7 and we have to take in account some special cases if we want to.

So the function we are coding has to do the following:

  1. Remove spaces and convert to uppercase (or lower case).
  2. Check if the input is an exception, if it is it should return valid
  3. Check if the input's length is 4 < length < 8.
  4. Check if it's a valid postcode.

The last part is tricky, but we will split it in 3 sections by length for some overview:

  1. Length = 7: AA9A9AA and AA999AA
  2. Length = 6: AA99AA, A9A9AA and A999AA
  3. Length = 5: A99AA

For this we will be using a switch(). From now on it's just a matter of checking character by character if it's a letter or a number on the right place.

So let's take a look at our PHP implementation:

function check_uk_postcode($string){
    // Start config
    $valid_return_value = 'valid';
    $invalid_return_value = 'invalid';
    $exceptions = array('BS981TL', 'BX11LT', 'BX21LB', 'BX32BB', 'BX55AT', 'CF101BH', 'CF991NA', 'DE993GG', 'DH981BT', 'DH991NS', 'E161XL', 'E202AQ', 'E202BB', 'E202ST', 'E203BS', 'E203EL', 'E203ET', 'E203HB', 'E203HY', 'E981SN', 'E981ST', 'E981TT', 'EC2N2DB', 'EC4Y0HQ', 'EH991SP', 'G581SB', 'GIR0AA', 'IV212LR', 'L304GB', 'LS981FD', 'N19GU', 'N811ER', 'NG801EH', 'NG801LH', 'NG801RH', 'NG801TH', 'SE18UJ', 'SN381NW', 'SW1A0AA', 'SW1A0PW', 'SW1A1AA', 'SW1A2AA', 'SW1P3EU', 'SW1W0DT', 'TW89GS', 'W1A1AA', 'W1D4FA', 'W1N4DJ');
    // Add Overseas territories ?
    array_push($exceptions, 'AI-2640', 'ASCN1ZZ', 'STHL1ZZ', 'TDCU1ZZ', 'BBND1ZZ', 'BIQQ1ZZ', 'FIQQ1ZZ', 'GX111AA', 'PCRN1ZZ', 'SIQQ1ZZ', 'TKCA1ZZ');
    // End config


    $string = strtoupper(preg_replace('/\s/', '', $string)); // Remove the spaces and convert to uppercase.
    $exceptions = array_flip($exceptions);
    if(isset($exceptions[$string])){return $valid_return_value;} // Check for valid exception
    $length = strlen($string);
    if($length < 5 || $length > 7){return $invalid_return_value;} // Check for invalid length
    $letters = array_flip(range('A', 'Z')); // An array of letters as keys
    $numbers = array_flip(range(0, 9)); // An array of numbers as keys

    switch($length){
        case 7:
            if(!isset($letters[$string[0]], $letters[$string[1]], $numbers[$string[2]], $numbers[$string[4]], $letters[$string[5]], $letters[$string[6]])){break;}
            if(isset($letters[$string[3]]) || isset($numbers[$string[3]])){
                return $valid_return_value;
            }
        break;
        case 6:
            if(!isset($letters[$string[0]], $numbers[$string[3]], $letters[$string[4]], $letters[$string[5]])){break;}
            if(isset($letters[$string[1]], $numbers[$string[2]]) || isset($numbers[$string[1]], $letters[$string[2]]) || isset($numbers[$string[1]], $numbers[$string[2]])){
                return $valid_return_value;
            }
        break;
        case 5:
            if(isset($letters[$string[0]], $numbers[$string[1]], $numbers[$string[2]], $letters[$string[3]], $letters[$string[4]])){
                return $valid_return_value;
            }
        break;
    }

    return $invalid_return_value;
}

Note that I've not added British Forces Post Office and non-geographic codes.

Usage:

echo check_uk_postcode('AE3A 6AR').'<br>'; // valid
echo check_uk_postcode('Z9 9BA').'<br>'; // valid
echo check_uk_postcode('AE3A6AR').'<br>'; // valid
echo check_uk_postcode('EE34      6FR').'<br>'; // valid
echo check_uk_postcode('A23A 7AR').'<br>'; // invalid
echo check_uk_postcode('A23A   7AR').'<br>'; // invalid
echo check_uk_postcode('WA3334E').'<br>'; // invalid
echo check_uk_postcode('A2 AAR').'<br>'; // invalid
HamZa
  • 14,671
  • 11
  • 54
  • 75
6

As supplied by the UK government.

   (GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})

I've built London only postcode based apps using the postcodes I got from HERE. But to be honest, even with London postcodes only, you need a lot more storage than necessary. Sure, the idea is trivial.

Store the postcodes, take the user input or whatever, and see if you get a match. But you are complicating the solution far more than you think. I HAD to use actual postcodes to achieve what I wanted, but for simple validation purposes, as hard as "maintaining" a regex is, storing tens of thousands or hundreds of thousands(if not more) and validating more or less in real-time is a far more difficult task.

If a mini distributed service sounds like a more efficient solution than a regex, go for it, but I'm sure it isn't. Unless you need geo-spatial querying of your own data against UK postcodes or things like that, I doubt DB storage is a feasible solution. Just my 2 cents.

Update

According to this index, there are 1,758,417 postcodes in the UK. I can tell you I am using a few Mongo clusters (Amazon EC2 High Memory Instances) to provide reliable London only services(indexing only London postcodes), and it's quite a pricy thing, even with basic storage.

Admittedly, the app is performing medium complexity geo-spatial queries, but the storage requirements alone are very expensive and demanding.

Bottom line, just stick to regex and be done with it in two minutes.

flavian
  • 28,161
  • 11
  • 65
  • 105
  • This is the best solution imho, let someone else more qualified do the work for you! If you get failures reported on your system (false negatives), it would usually be easy to see why they don't fit the standard model (perhaps British Forces or overseas territories) although I would expect the government regex to be pretty close to complete. – Lukos Apr 22 '13 at 15:52
  • 1
    @alex23 Please add an URL from where you got the regex, it might come in handy in the future for other people. – oxygen Apr 22 '13 at 16:05
  • This regex fails for the very first postcode in Code Point, AB565TR – oxygen Apr 24 '13 at 10:36
  • One obvious problem would be the required space character about here `...)))) [0-9][A-Z-[CIKMOV]]{2})`. I don't know about others. – oxygen Apr 26 '13 at 07:40
2

Im looking at the Postcodes in United Kingdom link in wikipedia right now.

http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom

The Validation section lists six formats with a combination of letters and numbers. Then there's more information in the notes below that. The first thing that I would try is a BNF type grammar with a tool like GoldParserBuilder. You could describe the basic formats in a more readable format, with efficient parser and lexer automatically generated. In the past, I've successfully used such tools to avoid writing huge, ugly regexes.

From that point, the program has a properly formatted zip code of a known type. At this point, the specific numbers or letters might violate something. Each type of zip code can have a function programmed to look for violations of that specific type. The final product will consist of an automatically generated parser that passes unvalidated, but structured/identified, zip codes to a dedicated validation function. You can then refactor or optimize from there.

(You can also use the grammar itself to enforce or disallow certain literals and combinations. Whatever is more readable or comprehensible for you. Different people gravitate toward different ends of these things.)

Here's a page highlighting advantages of GOLD Parsing System.You can use any you like: I just promote this one b/c it's good at its job and has steadily improved over many years. http://www.goldparser.org/about/why-use-gold.htm

Nick P
  • 1,477
  • 1
  • 11
  • 14
2

I would think the RegEX, while long-winded would probably be the best solution if all you want to do is validate if something could be a valid UK post code.

If you need absolute data, consider using Ordnance Survey OpenData initiative "Code-Point® Open" dataset, which is a CSV of lots of data points in Great Britain (so not Northern Ireland I'm guessing) one of which is postcode. Be aware that the file is 20MB, so you may have to convert it to a more manageable format.

Mikkel Løkke
  • 3,710
  • 23
  • 37
2

Regexes are hard to debug, hard to port from one regex flavor to another (silent "errors"), and hard to update.

That is true for most regexes, but why don't you just split it up into multiple parts? You can easily split it into six parts for the six different general rules and maybe even more if you take all of the special cases into account.

Creating a well-commented method of 20 lines with simple regexes is easy to debug (one simple regex per line) and also easy to update. The porting problem is the same, but on the other hand you do not need to use some fancy grammar lib.

TheBrain
  • 597
  • 6
  • 12
1

Are third party services an option?

http://www.postcodeanywhere.co.uk/address-validation/

GeoNames Database:

http://www.geonames.org/postal-codes/

Squiggs.
  • 4,299
  • 6
  • 49
  • 89
  • Think the GeoNames project is free, and webservices available: http://www.geonames.org/export/web-services.html#postalCodeSearch – Squiggs. Apr 22 '13 at 16:04
1

+1 for the "why care" comments. I have had to use the 'official' regex in various projects and while I have never attempted to break it down, it works and it does the job. I've used it with Java and PHP code without any need to convert it between regex formats.

Is there a reason why you would have to debug it or break it down?

Incidentally, the regex rule used to be found on wikipedia, but it appears to have gone.

Edit: As for the space/no-space debate, the postcode should be valid with or without the space. As the last part of the postcode (after the space) is ALWAYS three digits, it is possible to insert the space manually, which will then allow you to run it through the regex rule.

Phil Kingston
  • 131
  • 1
  • 4
  • Before discovering Code Point (which I have recently used to validate the regex used for validation), it was rather hard to detect false positives. – oxygen Apr 29 '13 at 12:41
0

Take the list of valid postcodes and check if the one entered is in it.

flup
  • 26,937
  • 7
  • 52
  • 74
  • 2
    UK postcodes are combinations of letters and numbers. To build up and maintain the list would require at least 10s of 1000s, not really a good suggestion in this case. – Lukos Apr 22 '13 at 15:50