-1

I have a simple PIG script that uses only the FILTER command in it. It looks something like this ...

--sample_script.pig
some_data = load './a_file' as (col1:chararray, col2:chararray);
contains_ = filter some_data by (col2 == '1') OR (col2 == '2');
store contains_ into './a_new_file';

When I run this script it outputs a folder a_new_file with 3 files in it part-m-00000, part-m-00001 and _SUCCESS. From what I can gather, the way I have written my script doesn't require a reduce phase. Is there a different way to write this so this script will output only one file?

Thanks.

o-90
  • 17,045
  • 10
  • 39
  • 63
  • Is it required that you not have a reduce phase ?, i mean if you can force your data through a single reducer it should do the job for you. Else you can override the inputformat to be non splittable (which would mean you run a single mapper). Else some kind of post processing to concat the files. – Sudarshan May 16 '14 at 06:01
  • 1
    possible duplicate of [merge output files after reduce phase](http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase) – reo katoa May 16 '14 at 13:41
  • @ WinnieNicklaus I believe you are correct. I was hoping to modify my script but running the extra command from [merge output files](http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase) works. – o-90 May 16 '14 at 16:55

2 Answers2

0

You can set number of reducers in the script itself

--sample_script.pig
set default_parallel 1;
some_data = load './a_file' as (col1:chararray, col2:chararray);
contains_ = filter some_data by (col2 == '1') OR (col2 == '2');
store contains_ into './a_new_file';

OR

You can combine small files

USB
  • 6,019
  • 15
  • 62
  • 93
0

You can use PARALLEL 1 for filter alone in the below way

contains = filter some_data by (col2 == '1') OR (col2 == '2') PARALLEL 1;

this will creates only one part file

Volker E.
  • 5,911
  • 11
  • 47
  • 64
harish kumar
  • 45
  • 1
  • 8