Say I have a cluster of 400 machines, and 2 datasets. some_dataset_1 has 100M records, some_dataset_2 has 1M. I then run:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a)); ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
Then, I run the join:
j1:=JOIN(ds1,ds2,LEFT.field_a=LEFT.field_b,LOOKUP,LOCAL);
Will the distribution of ds2 "mess up" the join, meaning parts of ds2 will be incorrectly scattered across the cluster leading to low match rate?
Or, will the LOOKUP keyword take precedence and the distributed ds2 will get copied in full to each node, thus rendering the distribution irrelevant, and allowing the join to find all the possible matches (as each node will have a full copy of ds2).
I know I can test this myself and come to my own conclusion, but I am looking for a definitive answer based on the way the language is written to make sure I understand and can use these options correctly.
For reference (from the Language Reference document v 7.0.0): LOOKUP: Specifies the rightrecset is a relatively small file of lookup records that can be fully copied to every node. LOCAL: Specifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE
It seems that with the LOCAL, the join completes more quickly. There does not seem to be a loss of matches on initial trials. I am working with others to run a more thorough test and will post the results here.