how to normalize a tuple of maps in apache pig?

Question

I have the following relation in a pig script:

my_relation: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,timeseries,([value#50.0,timestamp#1388675231000]))
(++JRGOCZQD,timeseries,([value#50.0,timestamp#1388592317000],[value#25.0,timestamp#1388682237000]))
(++GCYI1OO4,timeseries,())
(++JYY0LOTU,timeseries,())

There can be any number of value/timestamp pairs in the bytearray column (even zero).

I would like to transform this relation into this (one row for each entityId, attributeName, value, timestamp quartet):

++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
++GCYI1OO4,timeseries,,
++JYY0LOTU,timeseries,,

Alternatively this would be fine too - I am not interested in the rows that have no values/timestamp

++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000

Any ideas? Basically I want to normalize the tuple of maps in the bytearray column so that the schema is like this:

my_relation: {entityId: chararray,
              attributeName: chararray, 
              value: float, 
              timestamp: int}

I am a pig beginner so sorry if this is obvious! Do I need a UDF to do this?

This question is similar but has no answers so far: How do I split in Pig a tuple of many maps into different rows

I am running Apache Pig version 0.12.0-cdh5.1.2

EDIT - adding details of what I've done so far.

Here's a pig script snippet, with output below:

-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java. 
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);

c_no_flatten = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  TOBAG($2 ..);

c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));

d = foreach c generate
  entityId,
  attributeName,
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;

Output follows. Notice how in the relation 'c', the second value/timestamp pair [value#52.0,timestamp#1388683516000] is lost.

(++JIYMIS2D,RechargeTimeSeries,([value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000]))
(++JRGOCZQD,RechargeTimeSeries,([value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries,())
a: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries)
b: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,{([value#50.0,timestamp#1388675231000])})
(++JRGOCZQD,RechargeTimeSeries,{([value#50.0,timestamp#1388592317000])})
(++GCYI1OO4,RechargeTimeSeries,{()})
c_no_flatten: {entityId: chararray,attributeName: chararray,{(bytearray)}}

(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000])
(++GCYI1OO4,RechargeTimeSeries,)
c: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,50.0,1388675231000)
(++JRGOCZQD,RechargeTimeSeries,50.0,1388592317000)
(++GCYI1OO4,RechargeTimeSeries,,)
d: {entityId: chararray,attributeName: chararray,value: float,timestamp: int}

score 0 · Accepted Answer · edited May 23 '17 at 11:49

0

This should do the the trick. First, flatten the tuple of maps to get rid of the encapsulating tuple:

b = foreach a generate entityId, attributeName, FLATTEN($2);

Now we can convert everything but the first two fields into a bag. The bag can be flattened (see http://pig.apache.org/docs/r0.12.0/basic.html#flatten) to get rows for each value/timestamp pair:

c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));

Lastly, get the values you need out of the map:

d = foreach c generate
  entityId,
  attributeName,
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

Update: Some other options to make a bag of maps out of the tuple of maps:

DataFu's TransposeTupleToBag: http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/util/TransposeTupleToBag.html
The foo() Python UDF in this answer: Pig - how to iterate on a bag of maps

edited May 23 '17 at 11:49

Community

1
1

answered Oct 02 '14 at 04:10

user2303197

1,271
7
10

Thanks for the answer! Unfortunately it doesn't quite work. In the "c" relation, only the first value/timestamp pair is retained. All value/timestamp pairs after the first one are lost. That is, this row is missing from the final output in relation "d" (assuming the input data is same as in the original question above): `++JRGOCZQD,timeseries,25.0,1388682237000` – Jesse Oct 02 '14 at 17:20
Can you confirm you have the exact `flatten` statement as above, especially the two periods? Those are needed to pick up any field after the first one. Also, what do you get when you dump `b`? – user2303197 Oct 02 '14 at 18:28
I edited the original question with what I get when I dump the various relations. I am using the .. in the `flatten` call. – Jesse Oct 02 '14 at 21:19
Can you post details on how you get to `a`? It works fine for me when I e.g. do this to a variable width tuple I get from `STRSPLIT`. Also, if there's a way to modify `a` such that you get a bag of maps, then you're essentially done. – user2303197 Oct 02 '14 at 21:44
I added the bits of the script that show where `a` comes from, and what its dump and describe output are. Hope that helps. – Jesse Oct 02 '14 at 22:07
Still mystified... what do you get for `c` if you remove the `flatten` operator (dump/describe)? I also added some other options to the answer above that can help you to get to a bag of maps (that's the main thing that's missing to make this work). – user2303197 Oct 02 '14 at 23:40
I added another relation in the question called c_no_flatten that shows what happens without the call to FLATTEN. it's clear that the call TOBAG($2 ..) is dropping fields beyond $2. I'll have a look at TransposeTupleToBag. If I can't get that to work I suppose I'll have to look at doing it with a UDF. – Jesse Oct 03 '14 at 17:12

how to normalize a tuple of maps in apache pig?

1 Answers1