How do I sort data into a Zebra table in Pig?

Question

I am trying to store unsorted data from a CSV into a Zebra table in Pig using TableStorer. Do I need to do an ORDER BY before the store to make sure it's sorted and/or do I need to pass some information to the TableStorer to indicate the sort field?

score 1 · Accepted Answer · edited May 21 '14 at 18:37

1

As per the documentation at : Zebra and Pig in Sorting Data section :

Pig allows you to sort data by ascending or descending order (for more information, see the Pig reference manual). Currently, Zebra supports tables that are sorted in ascending order. Zebra does not support tables that are sorted in descending order; if Zebra encounters a table to be stored that is sorted in descending order, Zebra will issue a warning and store the table as an unsorted table.

So in case you want to save data sorted in descending order , it would be a good idea to sort the data tuples in pig script and then store them to Zebra table. At any time, the data in Pig is a collection of values. The data can always be sorted before saving/storing to destination by doing a simple ORDER BY.

Example:

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

In this example relation A is sorted by the third field, f3 in descending order. Note that the order of the three tuples ending in 3 can vary.

X = ORDER A BY a3 DESC;

DUMP X;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)

STORE X INTO 'output' USING org.apache.hadoop.zebra.pig.TableStorer('');

edited May 21 '14 at 18:37

reo katoa

5,751
1
18
30

answered May 21 '14 at 17:56

Krati Jain

368
2
10

That works fine for writing it, but when I try to read back more than one file using LOAD '{sample_201404*}' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') I get an error: Unable to create input splits for: hdfs://10.146.186.41:9000/user/hadoop/{sample_201404*} – bridiver May 21 '14 at 19:37
Caused by: java.lang.NullPointerException at org.apache.hadoop.zebra.io.KeyDistribution.add(KeyDistribution.java:50) at org.apache.hadoop.zebra.io.KeyDistribution.resize(KeyDistribution.java:204) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSortedSplits(TableInputFormat.java:654) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1013) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) – bridiver May 21 '14 at 19:41
What is the version of Hadoop, Pig and Zebra that you are using? – Krati Jain May 22 '14 at 05:59
@bridiver Looks like the issue you are facing is due to version incompatibilities as seen in this : http://stackoverflow.com/questions/21632476/pig-0-7-0-error-2118-unable-to-create-input-splits-on-hadoop-1-2-1 – Krati Jain May 22 '14 at 06:02
I'm using 0.12.1 and Zebra is part of the contrib code base in the release. It turned out to be two bugs in the Zebra code for handling sorted table unions. The first one occurred if the size of the input file was smaller than the block size and the other was the use of a deprecated method. I have patched them and will submit the changes. – bridiver May 22 '14 at 12:37

How do I sort data into a Zebra table in Pig?

1 Answers1