3

Assume I have 2 data sets A and B:

Data A;
input data $;
datalines;
A1
A2
A3
;
run;

Data B;
input data $;
datalines;
B1
B2
B3
;
run;

I whant to generate a data set with certain order as below :

A1
B1
B2
B3
A2 
B1
B2
B3
A3
B1
B2
B3

How to do that from data sets A and B by data step without POINT= ?

I have tried this method:

DATA WRONG_ANSWER;
SET A;
OUTPUT;
DO i = 1 to 3;
SET B;
OUTPUT;
END;
RUN;

Thes result is:

A1
B1
B2
B3
A2

Looks like the end-of-file indicator of B terminate this data step.

I also tried another approach with POINT=, and I got a correct result. However this approach is very slow due to the massive I/O time when accessing a particular obs from B :

DATA WRONG_ANSWER;
SET A;
OUTPUT;
DO i = 1 to 3;
SET B POINT=i;  //this is the only different from above 
OUTPUT;
END;
RUN;
Héctor
  • 24,444
  • 35
  • 132
  • 243
Gary Liao
  • 61
  • 5
  • Good question (now). I'm curious if there's a set-based method. I'll post a not-set-based method, but would love to see an alternative using set (and not point). For sure, random access (`point`) is the mehtod I'd naturally use in this case. It might be helpful to know what the relative sizes of the datasets are in your real case. – Joe Aug 21 '15 at 14:06
  • There are about 2 millian obs in both A and B data sets. – Gary Liao Aug 21 '15 at 14:29
  • 1
    Isn't this going to give ~4 trillion rows? (2mil * 2mil). No wonder it's slow =/. Whatever you are trying to do, turning two 2 million row tables into a 4 trillion row table doesn't sound like a good approach. – Robert Penridge Aug 21 '15 at 16:05

1 Answers1

2

Assuming a few things, the fastest method for this is likely the hash iterator solution. Assumptions:

  • Dataset B is small enough to fit (once) in memory.
  • Either you don't care about the order of dataset B rows in the resulting dataset, or you have an ascending or descending order by your key variables, or you can construct a key order variable.
  • Dataset B can have a key defined such that it contains unique rows, or you can use 'multidata:yes' (have a sufficiently new version of SAS to support that).

Given those assumptions, this works:

data want;
  if 0 then set b;
  if _n_=1 then do;
    declare hash b_hash(dataset:'b', ordered:'a');
    b_hash.defineKey('data');
    b_hash.defineData('data');
    b_hash.defineDone();
    declare hiter b_iter;
    b_iter = _new_ hiter('b_hash');
  end;
  set a;
  output;
  rc = b_iter.first();
  do while (rc=0);
    output;
    rc = b_iter.next();
  end;
run;

Depending on your use case, you may want to construct the defineData call via the macro system and/or a dictionary.columns query in order to avoid hardcoding the column names.

This is a lot faster than point; compared to the baseline:

data want_point;
  set a;
  output;
  do _n_ = 1 to nobs_b;
    set b point=_n_ nobs=nobs_b;
    output; 
  end;
run;
  • With a large A, 1e7 rows and small B, 3 rows, it takes ~10 seconds real time/8 seconds CPU time (not substantially more than the total write time), while the baseline point takes 100 seconds real time/12 seconds CPU time.
  • With a smaller A, point becomes somewhat more efficient, but is still outperformed by hash (though only slightly, likely not worth the difference in coding difficulty). Both approach 10 seconds write time to write out 1e4 A/ 1e3 B combination (which yields a similar size file to the first).
  • With a small A and large B ( 3 rows A, 1e7 rows B), hash takes a bit longer as it has an expensive first time set-up cost; 67 seconds for the hash solution (28 seconds CPU) vs 65 seconds for point (17 seconds CPU).

As such, Hash is recommended if you have a large dataset and are combining it repeatedly with a small dataset. If the two datasets are similarly sized or the repeatedly-set dataset is larger, point is probably as good as you will get (given the higher difficulty to maintain hash).

Joe
  • 62,789
  • 6
  • 49
  • 67
  • There is still a POINT in the second approach you provided, how can this approach faster than my last appoach? Was there something important that I didn't notice ? – Gary Liao Aug 24 '15 at 01:53
  • The second approach is the point approach - the baseline (ie what you were comparing from). I was testing it against hash and showing that it is sometimes perfectly fine performance wise and sometimes not. – Joe Aug 24 '15 at 01:54