1

I Followed This Stack Over Flow question where is shown how to count rows in pig.

The problem i found is, this one is incredibly time consuming if i do some regex filter match and other operation before try to count rows of filtered variable.

Here is my code

all_data = load '/logs/chat1.log' USING TextLoader() as line:chararray;
match_filter_1 = filter all_data by ( line matches 'some regex');
inputGroup = GROUP match_filter_1 ALL;
totalLine = foreach inputGroup generate COUNT (match_filter_1);
dump totalLine;

so, is there any way to get result faster?

Community
  • 1
  • 1
rubayet.R
  • 113
  • 1
  • 9

1 Answers1

0

Use the PARALLEL clause to increase the parallelism of a job:

PARALLEL sets the number of reduce tasks for the MapReduce jobs generated by Pig. The default value is 1 (one reduce task). PARALLEL only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. If you don’t specify PARALLEL, you still get the same map parallelism but only one reduce task.

A = LOAD 'myfile' AS (t, u, v);
B = GROUP A BY t PARALLEL 18;

Hope this Helps!!!...

Bhavesh
  • 909
  • 2
  • 23
  • 38
  • currently my pig work is postponed. so i can not try this right now and so accept or deny. but once i resume the work i will definitely try this and hope to make a positive review. – rubayet.R Nov 02 '16 at 14:32