Assuming a few things, the fastest method for this is likely the hash iterator solution. Assumptions:
- Dataset B is small enough to fit (once) in memory.
- Either you don't care about the order of dataset B rows in the resulting dataset, or you have an ascending or descending order by your key variables, or you can construct a key order variable.
- Dataset B can have a key defined such that it contains unique rows, or you can use 'multidata:yes' (have a sufficiently new version of SAS to support that).
Given those assumptions, this works:
data want;
if 0 then set b;
if _n_=1 then do;
declare hash b_hash(dataset:'b', ordered:'a');
b_hash.defineKey('data');
b_hash.defineData('data');
b_hash.defineDone();
declare hiter b_iter;
b_iter = _new_ hiter('b_hash');
end;
set a;
output;
rc = b_iter.first();
do while (rc=0);
output;
rc = b_iter.next();
end;
run;
Depending on your use case, you may want to construct the defineData
call via the macro system and/or a dictionary.columns query in order to avoid hardcoding the column names.
This is a lot faster than point; compared to the baseline:
data want_point;
set a;
output;
do _n_ = 1 to nobs_b;
set b point=_n_ nobs=nobs_b;
output;
end;
run;
- With a large A, 1e7 rows and small B, 3 rows, it takes ~10 seconds real time/8 seconds CPU time (not substantially more than the total write time), while the baseline point takes 100 seconds real time/12 seconds CPU time.
- With a smaller A, point becomes somewhat more efficient, but is still outperformed by hash (though only slightly, likely not worth the difference in coding difficulty). Both approach 10 seconds write time to write out 1e4 A/ 1e3 B combination (which yields a similar size file to the first).
- With a small A and large B ( 3 rows A, 1e7 rows B), hash takes a bit longer as it has an expensive first time set-up cost; 67 seconds for the hash solution (28 seconds CPU) vs 65 seconds for point (17 seconds CPU).
As such, Hash is recommended if you have a large dataset and are combining it repeatedly with a small dataset. If the two datasets are similarly sized or the repeatedly-set dataset is larger, point is probably as good as you will get (given the higher difficulty to maintain hash).