4

I have a problem like this at work:

Column Code has the value like, 1000, 1200, A1000, B1200, AAA, BBB, etc. Currently it is separated by spaces, sometimes more than one due to poor data input. And I am trying to check if a record contain a code that I am interested in.

Interested_Code: 1000 or A1000 or 444 or 555 or A555 etc.

I know a simple solution from this answer:

A.CODE LIKE CAT('% ', T3.Interested_Code, ' %')

I have appended a leading and trailing space to A.CODE to ensure a "full" exact match are returned. Because if I simply do

A.CODE LIKE CAT('%', T3.Interested_Code, '%') or
A.CODE CONTAINS T3.Interested_Code

I will get a false positive for code = 1000 at a row contained code = A1000, this matches part of the code, but not necessary a correct result.

My code works above, but it is doing too many test and really slow. Is there a faster or smarter way in PROC SQL? The main table is about 100k rows, and each row has around 10-20 codes. The interested code is about 8k values. Thanks.

George
  • 4,514
  • 17
  • 54
  • 81
  • 10-20 code variables (`code1` through `code20`), or 10-20 codes in a single column named `code` ? Do the column values contain any other information beside 1 code value that may or may not contain an embedded space ? – Richard Apr 30 '18 at 15:25
  • @Richard 10-20 code in a single column named 'code' at my main table. Each code is separated by 1 or more spaces. – George Apr 30 '18 at 15:36

3 Answers3

3

You could use FINDW or INDEXW, which find "words" (by default, things separated by spaces or similar). That is probably better than your solution, in particular because you won't find

"1000 "

since it doesn't start with a space, the way you are doing it.

proc sql;
  create table final_codes as
  select codes.*
  from codes where exists (
    select 1 from interested_codes
    where findw(codes.code,trim(interested_codes.code)) > 0)
  ;
quit;

However, this is effectively a cartesian join, and very slow. It has to join all possible combinations - 8000 times 100,000, or effectively 800 million temporary rows before it subsets down. It's just not going to be all that fast no matter what you do.

Doing this in a data step would be more efficient, in particular as you can more easily stop once you find a match. You can put the interested_codes table into a hash table or a temporary array, and then depending on your match frequency it may be faster to search each code in the interested_codes table, or the reverse, but either way stop when you find a match (instead of doing all possible combinations).

Joe
  • 62,789
  • 6
  • 49
  • 67
2

Try to use regular expression:

data want;
   set have;
   where prxmatch('/^1000$/',strip(code));
run;
Shenglin Chen
  • 4,504
  • 11
  • 11
2

There are two main problems

  1. The code data may contain a space between letter and digits
  2. The search space is over several target codes

The features of a solution will be

  1. normalize the code data
  2. search for any of the target codes
    2a. flag a record as matches any
    2b. elucidates which targets matched (a binary variable for each target code, or a result row for each matched target)

You would have to benchmark your installation for a performance comparison between various approaches.

This sample code is a mockup for 2a. A macro builds out a SQL case to tag an 'any match' condition. The resultant query is expensive as it requires a normalizing regex for each row, and all case conditions must fail for a row to be excluded from the result set.

data have;
  code = 'A    1000 1111 C333 555 A111 Z  999 B 222'; output;
  code = 'ZZZZZ 1121'; output;
  code = 'A    1000'; output;
  code = 'AB1000'; output;
run;

%macro withAnyOf (data=have, out=want, targets=);

  %local i qTarget N;
  %let N = %sysfunc(countw(&targets,%str( )));

%put NOTE: &=N;

  %do i = 1 %to &N;
    %local clause&i;

    %let qTarget = %sysfunc(quote(%qscan(&targets,&i,%str( ))));
    %let clause&i = when indexw (calculated codeCleaned, &qTarget) then 1;

    %put NOTE: &&clause&i;
  %end;

  proc sql;
    create table &out(drop=codeCleaned) as
    select 
      *
    , ' ' || prxchange('s/([A-Z]) +/$1/',-1,code) || ' ' as codeCleaned
    from &data
    where
      case
        %do i = 1 %to &N;
        &&clause&i
        %end;        
        else 0
      end
    ;
  quit;

%mend;

options mprint;

%withAnyOf (targets=1000 A1000 444 555 A555)
Richard
  • 25,390
  • 3
  • 25
  • 38