Produce single output from Pig Script using Filter

Question

I have a simple PIG script that uses only the FILTER command in it. It looks something like this ...

--sample_script.pig
some_data = load './a_file' as (col1:chararray, col2:chararray);
contains_ = filter some_data by (col2 == '1') OR (col2 == '2');
store contains_ into './a_new_file';

When I run this script it outputs a folder a_new_file with 3 files in it part-m-00000, part-m-00001 and _SUCCESS. From what I can gather, the way I have written my script doesn't require a reduce phase. Is there a different way to write this so this script will output only one file?

Thanks.

Is it required that you not have a reduce phase ?, i mean if you can force your data through a single reducer it should do the job for you. Else you can override the inputformat to be non splittable (which would mean you run a single mapper). Else some kind of post processing to concat the files. — Sudarshan, May 16 '14 at 06:01
possible duplicate of [merge output files after reduce phase](http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase) — reo katoa, May 16 '14 at 13:41
@ WinnieNicklaus I believe you are correct. I was hoping to modify my script but running the extra command from [merge output files](http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase) works. — o-90, May 16 '14 at 16:55

score 0 · Accepted Answer · answered May 16 '14 at 06:22

0

You can set number of reducers in the script itself

--sample_script.pig
set default_parallel 1;
some_data = load './a_file' as (col1:chararray, col2:chararray);
contains_ = filter some_data by (col2 == '1') OR (col2 == '2');
store contains_ into './a_new_file';

OR

You can combine small files

answered May 16 '14 at 06:22

USB

6,019
15
62
93

I'm still getting multiple files. – o-90 May 16 '14 at 16:45
Did u tried combining small files? fs -getmerge – USB May 17 '14 at 04:44

score 0 · Answer 2 · edited Sep 01 '14 at 18:21

0

You can use PARALLEL 1 for filter alone in the below way

contains = filter some_data by (col2 == '1') OR (col2 == '2') PARALLEL 1;

this will creates only one part file

edited Sep 01 '14 at 18:21

Volker E.

5,911
11
47
64

answered Sep 01 '14 at 17:54

harish kumar

45
1
8

Produce single output from Pig Script using Filter

2 Answers2