Pig Load problem with multiple delimiters

Question

I have some data log lines like

Sep 10 12:00:01 10.100.2.28 t: |US,en,5,7350,100,0.076241,0.105342,-1,0,1,5,2,14,,,0,5134,7f378ecef7,fec81ebe-468a-4ac7-b472-8bd1ee88bfc2

Sep 10 12:00:01 10.100.2.28 t: |US,en,3,22427,100,0.05816,0.04018,-1,0,1,15,15,0,24383,cyclops.untd.com/,0,2796,2c5de71073,4858b748-121a-4f60-8087-97a8527d57c6

Sep 10 12:00:01 10.100.2.28 t: |us,en,6,16839,100,-1,-1,-1,17,1,0,-1,0,13819,d.tradex.openx.com/,0,-1,,4f805e3b-86b7-4dee-ae68-24e726cde954

No as it is evident there are two delimiters (comma and space) .. While using the PigStorage function, I think I can only use one of them .... That leaves me with chararray of the other string with the other delimiter (space or comma).

I want to access each member of that chararray but cannot do so. I have also tried TOKENIZE but that gives a bag and I don't think items in a bag are ordered and thus can be accessed individually ...

Monks any help would be greatly appreciated ...

Tanuj

score 2 · Answer 1 · answered Sep 14 '11 at 23:50

2

You can write your own custom user-defined load function that can handle the loading in any way you want. Usually, if your format is some sort of weird custom format, you are going to be stuck doing this. You can also get the nice feature of having your custom loader automatically name the columns.

Your other option would be to preprocess your data before it gets into Pig to be nicely delimited. I'm not sure how your data is set up or how it is coming in, so I'm not sure if this is possible. In general, a little data grooming and sanitization is never a bad thing.

answered Sep 14 '11 at 23:50

Donald Miner

38,889
8
95
118

Stumbled upon this somehow -- there's actually a rather viable third option, instead of implementing a whole, full-blown `LOAD` UDF, one can use streaming. Basically, load all the stuff as lines and stream through either a trivial [insert fav lang here] script or just downright *nix commands. This particular example would be easily solved by streaming through `tr ' ,' '\t'` and using the right schema. – TC1 Apr 05 '13 at 21:26

score 1 · Answer 2 · answered Apr 17 '13 at 05:36

Simplest solution I can think of would be to use the built in PigStorage loader for one of the two delimiters then STRSPLIT to get the other one.

Example (assuming there's 19 comma separated fields since that's what it looked like):

A = LOAD 'myData' USING PigStorage(' ') AS
    (date:chararray,restOfCommaDelimitedFields:chararray);
B = FOREACH A GENERATE date, FLATTEN(STRSPLIT(restOfCommaDelimitedFields,19)) AS
    (country,language,field3,field4...etc);

Note this would break if there were spaces between any of your comma delimited fields.

score 0 · Answer 3 · answered Sep 12 '13 at 10:56

0

write you own UDF, it will be the best way to solve your problem

answered Sep 12 '13 at 10:56

Konstantin Kudryavtsev

562
1
11
23

Pig Load problem with multiple delimiters

3 Answers3