Instead of using directly a GROUP ALL
, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL
to sum all of these counts. This way, you would be able to count the number of rows in parallel.
Note, however, that if the field you use in the first GROUP BY
does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.
See this example:
a;1
a;2
b;3
b;4
b;5
If we first group by the first field, which has duplicates, the final COUNT
will deal with 2 rows instead of 5:
A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
However, if we group by the second one, which is unique, it will deal with 5 rows:
A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)