3

Suppose I have a directed graph represented in a dataset named links, which has two variables: from_id and to_id. I want to use SAS Data Step to do two things: (1) count the number of nodes, and (2) count the number of edges.

Suppose the links dataset is as shown below.

from_id    to_id
----------------
   1         2
   2         3
   3         1
   3         2

In this example, there are 3 nodes and 4 edges. (We can assume there are no duplicate edges in links). The nodes are 1, 2, and 3. The edges are 1->2, 2->3, 3->1, and 3->2.

Below is a SAS Macro that uses SAS Data Step in conjunction with proc sql in order to count the nodes and edges. It works perfectly, but I wish to use SAS Data Step so that counting the nodes and edges may (potentially) be done faster.

/* display number of nodes and edges for graph */
%macro graph_info(links);
data nodes;
    set &links;
    node_id = from_id;
    output;
    node_id = to_id;
    output;
    keep node_id;
run;

proc sql noprint;
    select count(distinct node_id) into :numNodes from nodes;
quit;
proc datasets lib=work nolist;
    delete nodes;
quit;

proc sql noprint;
    select count(*) into :numEdges from &links;
quit;

%put Nodes: &numNodes;
%put Edges: &numEdges;
%mend;
synaptik
  • 8,971
  • 16
  • 71
  • 98

1 Answers1

5

If you have enough memory, you may be able to do this with a hash object.

Be warned: this code is untested, as I don't have a SAS installation to hand. However the basic idea should work. You iterate through the data step, adding each node to the hash object, and on the last object you set macro variables to the size of the hash object.

data _null_;
  set links end=lastrec;
  format node_id 8.;
  if _N_ eq 1 then do;
    declare hash h();
    h.defineKey("node_id");
    h.defineDone();
  end;
  node_id = from_id;
  rc=h.find();
  if rc ne 0 then h.add();
  node_id = to_id;
  rc=h.find();
  if rc ne 0 then h.add();
  if lastrec then do;
    call symput('numLinks', put(h.num_items, 8. -L));
    call symput('numEdges', put(_N_, 8. -L));
  end;
run;
Simon Nickerson
  • 42,159
  • 20
  • 102
  • 127
  • The only two things I'd add to this are 1: don't use NOBS, as that could be wrong if the dataset was modified - use _N_ which is clearly safe; and 2: don't you need to check to see if the node exists in the hash before adding it? Otherwise you add duplicate nodes, I think? – Joe Nov 09 '12 at 20:30
  • 1
    I think duplicates are removed when you call add(), so that should be fine. But good point about _N_. I'll edit. – Simon Nickerson Nov 09 '12 at 20:33
  • `363 rc = h.defineKey(node_id);` and then `ERROR: Uninitialized object at line 363 column 9.` – synaptik Nov 09 '12 at 20:34
  • No. add() doesn't check automatically. I made an edit with your code after testing it - it should work now (if it's approved). You have a few minor syntax issues aside from that, that I corrected. – Joe Nov 09 '12 at 20:36
  • @Joe Thanks - I'll take your word for it. – Simon Nickerson Nov 09 '12 at 20:42
  • @synaptik: sorry, as I said, I don't have a version of SAS to test against. Does the edited version work? – Simon Nickerson Nov 09 '12 at 20:44
  • The edited version ran, and appeared to give the expected results in my opinion on the test dataset (3 and 4). Whether it fits in memory, no idea... – Joe Nov 09 '12 at 20:50
  • OK, thanks. It works now. I will try this, hopefully it's faster, because my proc sql solution takes a very long time! :) Thanks guys. I'll let you know if it's faster. – synaptik Nov 09 '12 at 20:57
  • 2
    GREAT! Took fewer than 10 minutes, whereas previously I was waiting over 45 minutes for my proc sql-based method. Thanks again. – synaptik Nov 09 '12 at 21:12