3

I am trying to read the data from the xml file using PIG but I am getting incomplete output.

Input File-

<document>   
<url>htp://www.abc.com/</url>
<category>Sports</category>
<usercount>120</usercount>
<reviews>    
<review>good site</review>
<review>This is Avg site</review>
<review>Bad site</review>
</reviews>
</document>

and the code I am using is :

register 'Desktop/piggybank-0.11.0.jar';
A = load 'input3' using org.apache.pig.piggybank.storage.XMLLoader('document') as (data:chararray);


 B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(data,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?<reviews>.*?<review>\\s*([^>]*?)\\s*</review>.*?</reviews>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int,review:chararray);

And the output I get is:

(htp://www.abc.com/,Sports,120,good site)

which is incomplete output.Can someone please help on what I am missing?

Sachin
  • 45
  • 2
  • 7
  • Based on the regex, the output is correct. You need to add `reviews` in regex to get all the `review`. Anyways, regex are not preferred for xml parsing (http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg). I would suggest you to use UDFs for it. – Abhishek May 05 '15 at 12:05
  • I tried using Reviews too. But still the output is incomplete – Sachin May 05 '15 at 12:11
  • You should add all the `review` tags separately. – Abhishek May 05 '15 at 12:13
  • Ya that works .But what if I have plenty of them . say 1000 reviews.can afford to add 1000 review tags.. – Sachin May 05 '15 at 12:15
  • Not sure, about that buddy. Let me give it a try and get back. I would still suggest you to use UDF for xml :) – Abhishek May 05 '15 at 12:16
  • Ok bro. thanks Your help is much appriciated.Still in search of generic solution – Sachin May 05 '15 at 12:17
  • Is it ok, if the text of all the `review` are appended together in one line? – Abhishek May 05 '15 at 12:33
  • yep will do..Comma separated reviews – Sachin May 05 '15 at 12:57

1 Answers1

2

huh!! Finally got it working using cross. I'm using XPath, you can use regex if you want. I find, XPath way to be easier and cleaner than regex. I guess, you can see it too. Don't forget to replace the testXML.xml with your XML.

XPath Way:

DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'document/url'), XPath(x, 'document/category'), XPath(x, 'document/usercount');
C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
D = FOREACH C GENERATE XPath(review,'review');
E = cross B,D;
dump E;

Regex Way:

A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int);
C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
D = FOREACH C GENERATE FLATTEN(REGEX_EXTRACT_ALL(review,'<review>([^>]*?)</review>'));
E = cross B,D;
dump E;

Output:

(htp://www.abc.com/,Sports,120,Bad site)
(htp://www.abc.com/,Sports,120,This is Avg site)
(htp://www.abc.com/,Sports,120,good site)

Isn't that you were expecting? ;)

Abhishek
  • 6,912
  • 14
  • 59
  • 85
  • Awesome Abhishek! Sorry to say but Xpath isn't working for Pig version 0.8.0 :( ERROR 1070: Could not resolve org.apache.pig.piggybank.evaluation.xml.XPath using imports – Sachin May 05 '15 at 13:47
  • @Sachin ok.. Added regex way to do it as well. I hope now your problem will be resolved. ;) – Abhishek May 05 '15 at 13:57
  • can I ask you something related to this?extension kind of ..if u dont mind – Sachin May 05 '15 at 14:08
  • what if I have multiple documents instead of one in the input? – Sachin May 05 '15 at 14:12
  • There are 2 ways. 1. You can load each document separately. 2. You can load data in hive table using some xml serde and use hive table ;) – Abhishek May 05 '15 at 14:14