0

I'm trying to use prxmatch to verify if postcode format (UK) is correct. The ('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/') bit covers (I think) all the possible post code formats used in UK, however I only want exact and not partial matches and no additional chars before or after match.

data pc_flag ; set abc ;

format  pc_correct_flag $1. compressed_postcode $100.;
compressed_postcode = compress(postcode);

pc_regex = prxparse('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/');

if prxmatch(pc_regex,compressed_postcode)>0

    then pc_correct_flag='Y'; 
    else pc_correct_flag='N';run;

I was expecting 'Y' only on exact matches on full string, i.e. with no additional characters before and after regex. However, I'm also getting false positives, where a part of 'compressed_postcode' matches regex, but there are additional characters after the match, which I thought using $ would prevent. I.e. I'd expect only something like AA11AA to match, but not AA11AAAA. I suspect this has to do with $ positioning but can't figure out exactly what's wrong. Any idea what I've missed?

  • Please post a few examples of the strings you're trying to match - the successful matches and the unsuccessful matches, if possible. We need a bit more info to help! – Nick Reed Nov 01 '19 at 14:33
  • I think you should use a non capturing group for the alternation `^(?:[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2})$` See https://regex101.com/r/zHfwB5/1 – The fourth bird Nov 01 '19 at 14:33
  • The below comes back as successful match: BS161BS OL162BX WF177JN LS285LYJ MK464BS` NR339QT8D TN56RTN6 However, I only want the top 3 (BS161BS OL162BX WF177JN) returned as the bottom 4 have extra characters after match – SAS_newbie Nov 01 '19 at 14:57
  • @The fourth bird: I see, however this doesn't work for some reason, all results are now 'N', ie nothing matches - used: pc_regex = prxparse('/^(?:[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2})$/') – SAS_newbie Nov 01 '19 at 15:18
  • @SAS_newbie In this demo it does not match all those values https://regex101.com/r/p6I4R4/1 – The fourth bird Nov 01 '19 at 16:27
  • @The fourth bird: I wonder if it's an issue with SAS not recognising non-capturing groups as I used exactly the code you linked and got zero matches – SAS_newbie Nov 01 '19 at 16:48
  • I found [this page](https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=p1vz3ljudbd756n19502acxazevk.htm&docsetVersion=9.4&locale=en) for Perl Regular Expression and that supports non capturing groups. You could try it with a capturing group `()` instead of `(?:)` but I don't that will help. – The fourth bird Nov 01 '19 at 16:53
  • Make sure to TRIM() the trailing spaces from the value of the variable. SAS stores character strings as fixed length. And if `compressed_postcode ` was not in `abc` then a side effect of attaching the `$100.` format to it will be to define it with a length of 100. – Tom Nov 01 '19 at 17:28
  • Why not just test the two patterns independently instead of trying to figure out how to use the `|` ? – Tom Nov 01 '19 at 17:33

2 Answers2

0

SAS character variables contain trailing spaces out to the length of the variable. Either trim the value to be examined, or add \s*$ as the pattern termination.

if prxmatch(pc_regex,TRIM(compressed_postcode))>0 then … 
Richard
  • 25,390
  • 3
  • 25
  • 38
0

Your regex is quite permissive - it allows every letter of the alphabet in every valid character position, so it matches quite a lot of strings that look like valid postcodes but do not exist as such, e.g. ZZ1 1ZZ.

I provided a more specific SAS-compatible postcode regex as an answer to another question - here's link in case this proves useful to you: https://stackoverflow.com/a/43793562/667489

That one still matches some non-postcode strings, but it filters out any with characters on Royal Mail's blacklists for each position within the postcode.

As per Richard's answer, you need to trim the string being matched before applying the regex, or amend the regex to match extra trailing blanks.

user667489
  • 9,501
  • 2
  • 24
  • 35