2

I have a question about the following 2 codes in SAS PROC SQL.

Code 1: (Standard Book version)

CREATE TABLE WORK.OUTPUT AS 
SELECT 
    "CLAIM" AS SOURCE,
    a.CLAIMID, 
    a.DXCODE
FROM 
    DW.CLAIMS_BAV AS a
WHERE 
    a.SITEID = '0001'
    AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)

Code 2: (The much faster way in practice)

CREATE TABLE WORK.OUTPUT AS 
SELECT 
    "CLAIM" AS SOURCE,
    a.CLAIMID, 
    a.DXCODE
FROM 
    DW.CLAIMS_BAV AS a
WHERE 
    a.SITEID = '0001'
    AND a.CLAIMID IN ('10001', '10002', '10003', ... '15000')

When I try to do it more elegantly by using subquery in #1, the run time blows up to 50 minutes +. But the same input returns within 3 minutes using Code 2. Why is that? Note, it's just as slow using INNER JOIN too (after reading this). The input is 5000+ CLAIMID, which I manually paste into the IN('...') block everyday.

PS: The CLAIMID are made up, in real life they are random.

The CLAIMID are indexed in DW.CLAIMS. I am using SAS PROC SQL to access an Oracle database. What is going on, and is there a better way? Thanks!

Joe
  • 62,789
  • 6
  • 49
  • 67
George
  • 4,514
  • 17
  • 54
  • 81
  • 1) Did you clear cache after first run? 2) Literals provide better plan than subquery (please post execution plans) – Lukasz Szozda Nov 03 '17 at 15:16
  • 1
    Following on lad's comment, when you use literals the database can use tricks like creating a hash with the values for much faster lookup. – Tim Biegeleisen Nov 03 '17 at 15:18
  • @lad2025 In SAS how to get the execution plans? – George Nov 03 '17 at 15:26
  • 1
    Is the DW libref pointing to SAS datasets or some external database? Perhaps you have forced SAS to pull the entire DW into SAS to join with your WORK dataset instead of just pushing some code into the DW. – Tom Nov 03 '17 at 15:26
  • 1
    Removing [tag:sql] and adding [tag:proc-sql] as this isn't really a question about generic SQL but is specific to SAS. – Joe Nov 03 '17 at 15:28
  • @Tom that's probably what is happening, the actual data are housed in oracle. many co-worker suspected sas pull all data in and then filter out for results. – George Nov 03 '17 at 15:30

5 Answers5

4

I don't know that I can tell you why SAS is so slow at the first select; something's not optimized in that scenario clearly.

If I had to guess, I'd guess that SAS is deciding in the first case that it can't use pass-through SQL and so it's downloading the whole big table and then running this SAS-side, while in the second case it's passing the query up to the SQL database and only transporting the resulting rows back.

But there are several ways to work around this, anyway. Here's one: use a macro variable to do precisely the pasting you're doing!

proc sql;
select quote(strip(claimid)) into :claimlist separated by ',' 
from work.input
;

CREATE TABLE WORK.OUTPUT AS 
SELECT 
    "CLAIM" AS SOURCE,
    a.CLAIMID, 
    a.DXCODE
FROM 
    DW.CLAIMS_BAV AS a
WHERE 
    a.SITEID = '0001'
    AND a.CLAIMID IN (&claimlist.)
;
quit;

Tada, you don't have to touch this anymore, and it's identical to the copy/paste that you did.

A few extra notes given some comments:

If CLAIMID is ever less than 15, you may have space padding, so I added strip to remove those. It doesn't matter for string comparisons - except insomuch as you might run out of macro language, and I worry that some DBMS may actually care about the padding. You can leave out strip if the 15 is a constant length.

Macro variables run up to 64K in space. If you have 15 character variable plus " " two plus comma one, you have 18 characters; you have room for a bit over 3500 values. That's under 5000, unfortunately.

In this case, you can either split up the field into two macro variables (easy enough hopefully, use obs and firstobs) or you can do some other solution.

  1. Transfer the work.input dataset into the DW libname, then do the join in SQL there.
  2. Put the contents of the claimID into a file instead of into a macro variable, and then %include that file.
  3. Use call execute to execute the whole proc SQL.

Here's one example of CALL EXECUTE.

data _null_;
  set work.input end=eof;
  if _n_=1 then do;
    call execute('CREATE TABLE WORK.OUTPUT AS 
SELECT 
    "CLAIM" AS SOURCE,
    a.CLAIMID, 
    a.DXCODE
FROM 
    DW.CLAIMS_BAV AS a
WHERE 
    a.SITEID = "0001"
    AND a.CLAIMID IN (');      *the part of the SQL query before the list of IDs;
  end;
  call execute(quote(claimID) || ' ');
  if EOF then do;
    call execute('); QUIT;');  *the part of the SQL query after the list of IDs;
  end;
run;

This would be nearly identical to the %INCLUDE solution really, except there you put that stuff to a text file instead of CALL EXECUTEing it, and then you %INCLUDE that text file.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • Actually Reeza's and mine should be 100% identical, hers has a few minor differences that are irrelevant - but as for who you pick with the checkmark I certainly don't care either way :) – Joe Nov 03 '17 at 15:54
  • Err... Thank you everyone!! I am too new to this. – George Nov 03 '17 at 15:56
  • I didn't mean you should switch it, the choice of that is yours. I just meant our solutions were identical as to how they'd work. They both came in at identical times so it's really up to you! – Joe Nov 03 '17 at 15:57
  • Updated with some more information, particularly if you have trouble with the macro variable length. – Joe Nov 03 '17 at 16:07
  • I love this answer, but I will look into it and create as another question (follow SO's policy) if I have more questions. Thank you!! – George Nov 03 '17 at 16:15
  • You don't want to `strip()` the values, but `trim()` them. The strip function would remove leading spaces and result would be to search for different values. – Tom Mar 07 '18 at 15:39
3

I think you're working both with local data and data on your server. When SAS is working with data from different sources (databases) it brings it all into SAS for processing which can be really, really slow.

Instead, you can make a macro variable and use that within your query. If it's 5000, it should fit into one macro variable, assuming the length is less than 13 chars each. A macro variable size limit is 64K characters, so it depends on the length of the variable. If not you could create a macro instead.

proc sql noprint;
   select quote(claimID, "'") into : claim_list separated by ", " from input;
quit;



proc sql;
CREATE TABLE WORK.OUTPUT AS 
SELECT 
"CLAIM" AS SOURCE,
a.CLAIMID, 
a.DXCODE
FROM  DW.CLAIMS_BAV AS a
WHERE 
a.SITEID = '0001'
AND a.CLAIMID IN (&claim_list.);
quit;
Reeza
  • 20,510
  • 4
  • 21
  • 38
  • 1
    Great minds think alike... SAS will convert the quotes properly when pass-through-ing the query, won't it? (I.e., you probably dont need the `"'"` argument on `quote`... – Joe Nov 03 '17 at 15:27
  • Yes, if SAS is pushing the query into the server for you it will replace the quotes. SAS also does not need the commas in the list of values. You can just use space as the delimiter instead. – Tom Nov 03 '17 at 15:29
  • Thanks Reeza, if the ClaimID is 15 characters long in Text. Would you know how much is the (estimated) limit without using a macro? 65536/15= 4369?... – George Nov 03 '17 at 15:44
  • 15 chars plus 2 for quotes and at least one for delimiter (although answer above uses 2 characters for the delimiter) = 18. 65536/18 = 3640. – Tom Nov 03 '17 at 18:59
  • I believe it's even possible to generate the macro var and then use it in a subsequent query in the same proc sql block if you want to golf a few lines. – user667489 Nov 03 '17 at 23:31
  • @user667489 in general yes, but since these are hitting different data sources (db vs SAS), it's good practice, IMO to separate these out. – Reeza Nov 04 '17 at 02:39
2

Please be sure to use

option sastrace=',,,ds' sastraceloc=saslog nostsuffix;

to receive information on how your code is translated by SAS/Aceess engine to DB statements.

In order to give SAS a hint to dynamicly build IN (1,2,3, ..) clause from your IN (SELECT .. query

  1. add MULTI_DATASRC_OPT=IN_CLAUSE to your libname DW ... statement and
  2. add dbmaster dataset option to the "master" table

like one of the following queries:

CREATE TABLE WORK.OUTPUT AS 
SELECT 
    "CLAIM" AS SOURCE,
    a.CLAIMID, 
    a.DXCODE
FROM 
    DW.CLAIMS_BAV (dbmaster=yes) AS a
WHERE 
    a.SITEID = '0001'
    AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)

or

CREATE TABLE WORK.OUTPUT AS 
SELECT 
    "CLAIM" AS SOURCE,
    a.CLAIMID, 
    a.DXCODE
FROM 
    DW.CLAIMS_BAV (dbmaster=yes) AS a
    inner join WORK.INPUT AS b
    on a.CLAIMID = b.CLAIMID
WHERE 
    a.SITEID = '0001'
vasja
  • 4,732
  • 13
  • 15
1

Using the In() without sub-querying is definitely faster, but other performance consideration to keep in mind is the network and compute server load/traffic at the time of running; assuming you are running on a client / server configuration.

If you plan to use the SQL select into macro variable solution; keep in mind the count of distinct values and the length of the string you are saving in the macro as there is a size limit.

momo1644
  • 1,769
  • 9
  • 25
1

You can also save the In() values in a table and just do a join.

PROC SQL;
/*CLAIM ID Table*/
CREATE TABLE WORK.OUTPUT1 AS 
SELECT 
    "CLAIM" AS SOURCE,
    a.CLAIMID, 
    a.DXCODE
FROM 
    DW.CLAIMS_BAV AS  a
WHERE 
    a.SITEID = '0001';
/*ID Lookup Table*/
CREATE TABLE WORK.OUTPUT2 AS 
SELECT
DISTINCT b.CLAIMID  FROM WORK.INPUT AS b
;
/*Inner Join Table / AKA lookup join*/
CREATE TABLE WORK.Final AS 
SELECT 
a.SOURCE,    a.CLAIMID,     a.DXCODE
FROM WORK.OUTPUT1 AS a INNER JOIN WORK.OUTPUT2 AS b
ON a.CLAIMID = b.CLAIMID
;
QUIT;
sorak
  • 2,607
  • 2
  • 16
  • 24
ofzy
  • 95
  • 11
  • As far as the structure and operation of this query goes, you've got a lot to explain, here. – sorak Mar 08 '18 at 08:22