1

I am wondering what is the ECL equivalent to "rank over partition" sql statement. For a project that I will be working on, I need to assign a rank according to date. I know how to do this in PySpark, but I do not want to export all the data to local.

I was thinking about using Iterate, but still not sure how to do that. Can't find relevant info in the documentation.

BDL
  • 21,052
  • 22
  • 49
  • 55
tianrenqin
  • 11
  • 1

1 Answers1

3

The SQL "partition by" translates to ECL as the GROUP() function, and ranking in ECL is easily done using PROJECT's COUNTER within each group.

I found an SQL example of this here and re-wrote that example in ECL:

rec := RECORD
  UNSIGNED1   id;
  STRING15    name; 
  STRING10    category; 
  UDECIMAL6_2 price;
END;    
ds := DATASET([{105,'Country Living','lifestyle',1.70},
               {108,'News Magazine','news',3.35},
               {115,'Decibel','music',6.50},
               {123,'Drum Magazine','music',6.50},
               {145,'Sunset','lifestyle',12.00},
               {155,'World','lifestyle',9.50},
               {158,'Keyboard','music',8.45}],rec);
sds := SORT(ds,category,-price);                             
gds := GROUP(sds,category); //creates a subgroup for each category
    //each subsequent operation on GROUPed data
    // executes separately and independently on each subgroup 
RankInPartition := PROJECT(gds,
                           TRANSFORM({gds,UNSIGNED1 price_rank},
                                     SELF.price_rank := COUNTER,
                                     SELF := LEFT));
OUTPUT(RankInPartition,{category,name,price,price_rank});                                                                        

You could also use ITERATE to do this, like this:

sds := SORT(ds,category,-price);                             
tds := TABLE(sds,{sds,price_rank := 0});                             
RankInPartition := ITERATE(tds,
                           TRANSFORM(RECORDOF(tds),
                                     SELF.price_rank :=
              IF(LEFT.category=RIGHT.category,LEFT.price_rank+1,1),
                                     SELF := RIGHT));

but this would be a global operation, and on huge datasets (we are working with "Big Data" here) it would likely be quite slow compared to the GROUP solution.

For the GROUP solution, each subsequent operation executes on each subgroup separately and independently. That means that if you have 40 nodes, at least 40 separate subgroups are executing in parallel for each grouped operation.

So in this example, the PROJECT ranks each subgroup separately and independently, adding the work field (price_rank) at the same time. But the ITERATE version requires a TABLE to add the new work field (price_rank) before ITERATE can work. And both of those operations are global.

HTH,

Richard

Richard Taylor
  • 493
  • 2
  • 6