What's the effective way to count rows in Pig?

Question

In Pig, what is the effective way to get count? We can do a GROUP ALL, but this is given only 1 reducer. When the data size is very large,say n Terabytes, can we try multiple reducers somehow?

  dataCount = FOREACH (GROUP data ALL) GENERATE 
    'count' as metric,
    COUNT(dataCount) as value;

Possible duplicate of [PIG how to count a number of rows in alias](http://stackoverflow.com/questions/9900761/pig-how-to-count-a-number-of-rows-in-alias) — WattsInABox, Jan 12 '16 at 22:35

score 7 · Accepted Answer · answered May 12 '15 at 08:32

Instead of using directly a GROUP ALL, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL to sum all of these counts. This way, you would be able to count the number of rows in parallel.

Note, however, that if the field you use in the first GROUP BY does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.

See this example:

a;1
a;2
b;3
b;4
b;5

If we first group by the first field, which has duplicates, the final COUNT will deal with 2 rows instead of 5:

A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)

However, if we group by the second one, which is unique, it will deal with 5 rows:

A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)

kecso · Answer 2 · 2015-05-12T20:22:26.127

-1

I just dig a bit more in this topic, and it seems you don't have to afraid that a single reducer will have to process enormous amount of data if you're using an up-to-date pig version. The algebraic UDF-s will handle the COUNT smart, and it's calculated on the mapper. So the reducer just have to deal with the aggregated data (counts/mapper). I think it's introduced in 0.9.1, but 0.14.0 definitely has it

Algebraic Interface

An aggregate function is an eval function that takes a bag and returns a scalar value. One interesting and useful property of many aggregate functions is that they can be computed incrementally in a distributed fashion. We call these functions algebraic. COUNT is an example of an algebraic function because we can count the number of elements in a subset of the data and then sum the counts to produce a final output. In the Hadoop world, this means that the partial computations can be done by the map and combiner, and the final result can be computed by the reducer.

But my previous answer was definitely wrong:

In the grouping you can use the PARALLEL n keyword this set the number of reducers.

Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task).

edited May 12 '15 at 20:22

answered May 12 '15 at 05:30

kecso

2,387
2
18
29

`GROUP ALL` generates one group, and a group must be in only one reducer. Therefore `PARALLEL` will not help in this situation. – Balduz May 12 '15 at 08:18
And what about the default parallel? `SET default_parallel 20;` ? – kecso May 12 '15 at 13:17
Each group is executed in only ONE node, it does not matter the level of parallelism, the default parallel, etc. If you use `group all`, you have a huge group in only one node no matter the level of parallelism. – Balduz May 12 '15 at 13:19
Probably the default parallel doesn't work for the same reason as Balduz said. But here I found something: http://pig.apache.org/docs/r0.11.1/perf.html#parallel `pig.exec.reducers.bytes.per.reducer` - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB). – kecso May 12 '15 at 13:22
Again... each group is executed in one node and node only. As you can see in the official documentation, a `group all` is performed in only ONE node no matter the parallel level, since it will replace it with 1: http://pig.apache.org/docs/r0.11.1/perf.html#GroupByConstParallelSetter – Balduz May 12 '15 at 13:27
Yes it's clearly mention this for the `PARALLEL` keyword. Now I'm get curious so I'll try what the `pig.exec.reducers.bytes.per.reducer` does in this case :) (probably you're right and it also doesn't help... we'll see) – kecso May 12 '15 at 13:36
No success, it does nothing in this case. (tested) You're right Balduz. – kecso May 12 '15 at 15:28

What's the effective way to count rows in Pig?

2 Answers2