1

So here is the problem.

I have a dataset and for each record, depending on a criteria, I want to load a different hash. I do not know the exact hash structure for each hash I will be loading during runtime. So I want to be able to execute the definedata statement conditionally. But since I do not know the hash structure, I thought of passing the parameter to the definedata statement via a variable, but it does not work. How can I accomplish this? Here is what I have so far:

/* Hashes have the same key field */
data hash1;
  key = '1';  a = 10; b = 20; output;
  key = '2';  a = 30; b = 40; output;
run;

/* Hash objects can have different data members and  types */
data hash2;
  key = '1';  x = 'AAA'; y = 'BBB'; output;
  key = '2';  x = 'CCC'; y = 'DDD'; output;
run;

/* This the dataset I want to process */
/* hid specifies which hash I should lookup */
/* key contains the key value to use for the lookup */
/* def is the hash data definition piece of the hash. 
   In practice I will use another hash to retrieve this definition
   But for simplicity we can assume that is part of the have dataset itself */

data have;
  hid = '1'; key = '2'; def = "'a', 'b'"; output;
  hid = '2'; key = '1'; def = "'x', 'y'"; output;
run;

/* This is what I want */

data want;
  set have;

  /* Though I don't know the structure of each hash, I can get a list of all hashes at the onset via some macro processing. So this statement is doable */
  if _N_ = 0 then set hash1 hash2;

  /* This part is OK. The hash declaration is able to accept a variable for the dataset name */

  hashname = "hash" || hid;
  declare hash hh(dataset: dsname);
  hh.definekey('key');

  /* The following line is the problematic piece */
  hh.definedata(def);

  hh.definedone();

  rc = hh.find();
  /* Do something with the values */

  /* Finally delete the object so that it can be redefined again on the next record */
  hh.delete();

run;

The error I get is: ERROR: Undeclared data symbol 'a', 'b' for hash object. I think the issue here is that the definedata method parses the variables one by one and ends up treating the entire string 'a', 'b' as one variable.

If I were to define the hash to be a superset of all possible variables, then it complains when I load a dataset that contains a subset of those variables. Also I cannot have the hashes defined to contain a superset of all variables (i.e. I cannot create all hashes to contain a, b, x and y and leave the extraneous elements missing).

So my question is how can I accomplish what I am trying to do here? Is it possible to do macro %do like iterations just using datastep constructs to provide each variable one by one? Or are there other ways to do it?

Constraints

  1. I cannot rely on macro processing since I only know which hash I am going to use during runtime.
  2. I cannot load all the definitions ahead of time for memory reasons.

Any help will be appreciated.

SAS2Python
  • 1,277
  • 5
  • 16
  • Does the *Do something with the values* unconditionally deal with the hash find data retrieval ? Or is the something akin to `if hid="1" then do; something with a & b; end; else if hid="2" then do; something with x & y; end;` – Richard Jul 25 '19 at 10:58

2 Answers2

3

You can store hash references in a separate hash. This is called a hash of hashes. Load the hash of hashes with the references to the individual hashes that are loaded only one time at the start of the step.

Example:

data hash1;length key $1;input
key a b; datalines;
1 10 20
2 30 40
3 50 60
4 70 80  
run;

data hash2;length key $1;input
key x $3. y: $3.; datalines;
1 AAA BBB
2 CCC DDD
3 EEE FFF
4 GGG HHH
run;

data hashdataspec; length hid $1;input
hid datavars&: $15.;datalines;
1   a,b
2   x,y
run;

data have;
  do rowid = 1 to 100;
    p = floor (100*ranuni(123));
    q = 100 + ceil(100*ranuni(123));

    length r s $15;
    r = scan ("One of these will become the R value", ceil(8*ranuni(123)));
    s = scan ("How much wood would a woodchuck chuck if ...", ceil(9*ranuni(123)));

    length hid key $1;
    hid = substr('12',   ceil(2*ranuni(123)));
    key = substr('1234', ceil(4*ranuni(123)));

    output;
  end; 
run;

data want;
  sentinel0 = ' ';
  if 0 then set hash1-hash2 hashdataspec; * prep pdv for hash host variables;
  sentinel1 = ' ';

  * prep hashes, one time only;
  if _n_ = 1 then do;
    * load hash data specifiers;
    declare hash hds(dataset:'hashdataspec');
    hds.defineKey('hid');
    hds.defineData('hid', 'datavars');
    hds.defineDone();

    * prep hash of hashes;
    declare hash h;      /* dynamic hash that will be added to hoh */
    declare hash hoh();  /* hash of hashes */
    hoh.defineKey ('hid');
    hoh.defineData ('h');
    hoh.defineDone();

    * loop over hashdataspec, loading dynamically created hashes;
    declare hiter hi('hds');
    do while(hi.next() = 0);
      h = _new_ hash(dataset:cats('hash',hid));    * create dynamic hash;
      h.defineKey('key');
      do _n_ = 1 to countw(datavars);
        h.defineData(scan(datavars,_n_,','));      * define data vars, one at a time;
      end;
      h.defineDone();
      hoh.add();  * add the dynamic hash to the hash of hashes;
    end;
  end;

  * clear hash host variables;
  call missing (of sentinel0--sentinel1);

  set have;

  * lookup which hash (hid) to use
  * this will select the appropriate dynamic hash from hoh and update hash variable h;
  hoh.find();

  * lookup data for key in the hids hash;
  h.find();

  drop datavars;
run;
Richard
  • 25,390
  • 3
  • 25
  • 38
  • Unfortunately this pre-loads all hashes during startup which I want to avoid. The solution I wanted should have only one hash in memory at a given time - though it can lead to runtime performance issues. +1 for the hash of hashes approach which enables to use a single hash reference with the code for lookups – SAS2Python Jul 25 '19 at 18:38
  • What benefit do you perceive for loading the hash up from disk for each row in `have`? If you understand the runtime performance problem why continue down that path ? What is the cardinality of the hash tables (i.e. how many keys in total across all hashes ?) HOH lets you use a single hash variable for multiple hash references. – Richard Jul 25 '19 at 19:26
  • I am not at the liberty of giving you the bigger picture, but this is not a batch process. Records periodically arrive from standard input and the record has to be processed and sent to standard output at periodic intervals. So it is OK to spend additional few milliseconds performing the hash load during runtime. – SAS2Python Jul 25 '19 at 19:34
2

Your program can be made to work but I think performance will be poor.

Notice I changed the value of DEF so it would be easier to SCAN.

data have;
   hid = '1'; key = '2'; def = "a b"; output;
   hid = '2'; key = '1'; def = "x y"; output;
   run;

/* This is what I want */

data want;
   if _N_ = 0 then set hash1 hash2;
   call missing(of _all_);
   set have;
   hashname = "hash" || hid;
   declare hash hh(dataset: hashname);
   hh.definekey('key');
   /* The following line is the problematic piece */
   length v $32;
   do i = 1 by 1;
      v = scan(def,i,' ');
      putlog v= i=;
      if missing(v) then leave;
      *hh.definedata(def);
      hh.definedata(v);
      end;
   hh.definedone();
   *hh.output(dataset: cats('X',hashname));

   rc = hh.find();
   /* Do something with the values */

   /* Finally delete the object so that it can be redefined again on the next record */
   hh.delete();
   run;
data _null_
  • 8,534
  • 12
  • 14
  • Definitely does the trick. Thank you very much. After all these years of using SAS, I wasn't aware of the fact that you could invoke the `definedata`method multiple times for the same hash. That was the missing piece for me. – SAS2Python Jul 24 '19 at 23:03
  • What makes you think this loop will make it slow? I don't think it is not a lot of processing even if there are say a 1000 variables in the hash. – SAS2Python Jul 24 '19 at 23:04
  • several millions usually read from standard input - kind of streaming data – SAS2Python Jul 24 '19 at 23:18
  • @PythonRSAS you probably don't want to redefine the hash for every single observation then? Try to group the observations together so that you can load each 'type' of hash once only and you'll see much greater performance. – Robert Penridge Jul 24 '19 at 23:27
  • That's a possibility, yes – SAS2Python Jul 24 '19 at 23:36