0

Expected output is: (Hadoop definitive guide,Tom white,24.90).

I have tried using the Regex_Extract() function. But, no luck yet. Can someone please help me out?

The input to my script is:

<CATALOG>
<BOOK>
<TITLE>Hadoop DEFINITIVE GUIDE</TITLE>
<AUTHOR>TOM WHITE</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>
Manjunath Ballur
  • 6,287
  • 3
  • 37
  • 48
Mrudula
  • 11
  • 1
  • 3

1 Answers1

0

You will have to extract <TITLE>, <AUTHOR> and <PRICE> separately and then join them together using JOIN operator.

Following script achieves that:

-- Load input 
A = LOAD '/input.txt' USING PigStorage() AS (f1:chararray);

-- Extract <TITLE>
B1 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<TITLE>(.*)</TITLE>', 1) AS (title:chararray);
C1 = FILTER B1 BY title is not null;
D1 = RANK C1;

-- Extract <AUTHOR>
B2 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<AUTHOR>(.*)</AUTHOR>', 1) AS (author:chararray);
C2 = FILTER B2 BY author is not null;
D2 = RANK C2;

-- Extract <PRICE>
B3 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<PRICE>(.*)</PRICE>', 1) AS (price:chararray);
C3 = FILTER B3 BY price is not null;
D3 = RANK C3;

-- Join 3 data sets
D = JOIN D1 BY $0, D2 BY $0, D3 By $0;

-- Eliminate the ranks
E = FOREACH D GENERATE $1 AS (title:chrarray), $3 AS (author:chararray), $5 AS (price:chararray)

dump E;

For the input mentioned in the question, I got the following output:

(Hadoop DEFINITIVE GUIDE,TOM WHITE,24.90)
(Programming Pig,Alan Gates,30.90)
Manjunath Ballur
  • 6,287
  • 3
  • 37
  • 48
  • ok i'm able to extract the individual data,but m not able to join the 3 datasets..getting an parsing error. Error org.apache.pig.tools.grunt.Grunt-Erroe 10000:Error during parsing. Encountered " " at line 1..also not able to execute the Rank cmd..still modifying the above commands a bit i'm able to extract dem...not able to join dem..what m i doing wrong..plz help... – Mrudula Dec 27 '15 at 06:11
  • B = foreach A GENERATE FLATTEN(REGEX_EXTRACT(x,'(.*)',1)) AS (title:chararray); i extracted the individual data.. – Mrudula Dec 27 '15 at 06:20
  • Which version of Pig are you using? My version of Pig is 0.14. This script worked perfectly for me. I have even posted the answer I got by running the script in my setup. Can you check `pig --version`? Probably, your version of Pig does not support `Rank`. `Rank` function is supported from Pig 0.11 onwards. – Manjunath Ballur Dec 27 '15 at 06:22
  • i'm getting an identifier error for RANK command..ENCOUNTERED C1,was expecting "as",";"...do we have any substitute command for RANK...coz m den getting an error for the last Command.. – Mrudula Dec 27 '15 at 06:32
  • I asked you, what's your Pig version? Can you please tell me that? – Manjunath Ballur Dec 27 '15 at 06:33
  • ya ryt ,i'm using version 0.8...this is what is being used in my class...so is der any other way out...code works fine till join cmd...last one gives me an error...i have to submit it today by 4... – Mrudula Dec 27 '15 at 06:39
  • OK. So, that's the problem. In 0.8 `Rank` function is not supported. Let me check, if I can modify the script. If find alternative, I will update the answer/ – Manjunath Ballur Dec 27 '15 at 06:40
  • B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'\\s*(.*)\\s*(.*)[a-zA-Z\\s*\\S*]+(.*)[a-zA-Z\\s*\\S*]+')); Dump B hey i got the answer:but i'm having difficulty understanding the regex passed here of [a-zA-Z\\s*\\S*]+' – Mrudula Dec 29 '15 at 10:30
  • [a-zA-Z\\s*\\S*]+ means one or more occurrences of alphabets, whitespace and non-whitespaces. Check the links: http://stackoverflow.com/questions/13750716/what-does-regular-expression-s-s-do and http://stackoverflow.com/questions/4377480/what-does-this-s-regex-mean-in-javascript – Manjunath Ballur Dec 29 '15 at 12:02
  • Please post your answer. It would be of help. – Manjunath Ballur Dec 29 '15 at 15:44