Why Proc Sql is faster than Proc Summary for sum-function?

Question

I have a column-based DB2-Blu table (DB2-Blu fixes indexing by itself) with 30.000 peer-groups and 50.000 values per peer group, total 1.5 billion rows.

I did a test to compare the run-time with two different processes: Proc Summary and Proc Sql like below:

proc summary data = table_blu  nodisplay nway missing chartype;
   /* Var1 and var2 are toghether peer-group  */
   class var1 var2 / groupinternal; 
   var values;

   output out = stattable
   sum=sum 
;
run;

Calculating only the sum-up through the peer-groups and it took around 14 min

Then I did the similar calculation with Proc SQL like below:

proc sql;
   create table stattable as
   select var1, var2, sum(values);
   from table_blu
   group by var1, var2;
quit;

This Proc SQL took only 1,26 min. It is definitely much less run_time comparing to the 14 min with Proc Summary process.

How can it be so big difference? Can it happen because the table is column-based? Maybe SAS is not optimal for column-based tables?

As someone [already explained to you](http://stackoverflow.com/questions/43519637/which-statistics-is-calculated-faster-in-sas-proc-summary), dragging the entire table into SAS over the network and calculating aggregates there is going to be inherently slower than doing aggregations where the data are, in the database, and sending only results over the wire. — mustaccio, Apr 21 '17 at 20:26
Possible duplicate of [Which statistics is calculated faster in SAS, proc summary?](http://stackoverflow.com/questions/43519637/which-statistics-is-calculated-faster-in-sas-proc-summary) — david25272, Apr 26 '17 at 01:20

score 1 · Answer 1 · answered Apr 24 '17 at 16:02

PROC SUMMARY will perform in-database processing if the database supports it and is supported by SAS, which will then be quite fast (and may well end up being identical to the PROC SQL, or may be faster).

However, this depends on both the DBMS and SAS. You're also adding an added layer here - the database acceleration software itself.

SAS has supported DB2 for a long time, but it's only supported DB2-BLU since 9.4M3 (one maintenance release before the current one, so about a year and a half). If you don't have 9.4M3 then you almost certainly don't have direct support, and thus probably aren't taking advantage of BLU and/or aren't taking advantage of in-database processing.

It's also possible that, since this is fairly new, that the in-database processing isn't as well implemented as you'd need; for example, I don't know if groupinternal is supported. Anything that SAS doesn't know how to translate to the DBMS, whether because it's something the DBMS doesn't support or because SAS hasn't implemented support for it yet, will cause SAS to download the whole table.

So my suggestion is:

Verify you have 9.4 TS1M3 or newer, and if not, get it. There is not typically a charge for upgrading to newer versions of SAS, so this should be doable, if some work.
If you do, try running a simpler summary (remove all options, just proc summary data=[yourdatabase.yourtable]; class [some class variable]; var [some variable]; output out=[something] sum=; run; or similar) and see if that is quicker. That'll at least tell you something.
Have your DBA or IT folks look at the database and/or network traffic and see if PROC SUMMARY is downloading the whole dataset; the DBA should be able to tell you what query SAS is running (if it's running a SQL summary or if it's just running SELECT and then is running the summary SAS-side) and/or IT should at least be able to tell you what's going over the network size-wise. Even if SAS is running the summary in DB2, maybe it's not running an optimized query.
Then, if you do have 9.4 TS1M3, open a SAS support track and give them the details from 2 and 3, plus your original question. They should be able to tell you whether it's expected that this query run on the database or if it's always going to download the table first, and then if so, what you can do about it (if anything).

score 0 · Answer 2 · answered Apr 21 '17 at 20:32

Assuming that the source table is within the work libname and not in a remote database, proc summary should perform faster than proc sql.

I tested below by creating a table with 190m observations and summarizing the data using both techniques:

332  data x;
333    set sashelp.class;
334    do i=1 to 10000000;
335      output;
336    end;
337  run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.X has 190000000 observations and 6 variables.
NOTE: Compressing data set WORK.X decreased size by 15.12 percent.
      Compressed is 118495 pages; un-compressed would require 139604 pages.
NOTE: DATA statement used (Total process time):
      real time           1:23.30
      cpu time            1:04.79


338
339
340  proc summary data=x noprint nway missing chartype;
341     class sex age / groupinternal;
342     var height;
343     output out=stattable sum=sum;
344  run;

NOTE: Multiple concurrent threads will be used to summarize data.
NOTE: There were 190000000 observations read from the data set WORK.X.
NOTE: The data set WORK.STATTABLE has 11 observations and 5 variables.
NOTE: Compressing data set WORK.STATTABLE increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           1:04.79
      cpu time            1:11.65


345
346
347  proc sql noprint;
348    create table xx as
349    select sex, age, sum(height)
350    from x
351    group by 1,2
352    ;
NOTE: SAS threaded sort was used.
NOTE: Compressing data set WORK.XX increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: Table WORK.XX created, with 11 rows and 3 columns.

353  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           2:09.56
      cpu time            1:47.12

As you can see proc sql took almost twice the time.

Why Proc Sql is faster than Proc Summary for sum-function?

2 Answers2