Problem: I have a data set A {filed1, field2, field3...}, and I would like to first group A by say, field1
, then in each of the resulting groups, I would like to do bunch of subqueries, for example, count the number of rows that have field2 == true
, or count the number of distinct field3
that have field4 == "some_value"
and field5 == false
, etc.
Some alternatives I can think of: I can write a customized user defined aggregate function that takes a function that computes the condition for filtering, but this way I have to create an instance of it for every query condition. I've also looked at the countDistinct
function can achieve some of the operations, but I can't figure out how to use it to implement the filter-distinct-count semantic.
In Pig, I can do:
FOREACH (GROUP A by field1) {
field_a = FILTER A by field2 == TRUE;
field_b = FILTER A by field4 == 'some_value' AND field5 == FALSE;
field_c = DISTINCT field_b.field3;
GENERATE FLATTEN(group),
COUNT(field_a) as fa,
COUNT(field_b) as fb,
COUNT(field_c) as fc,
Is there a way to do this in Spark SQL?