I want to parse an XML file using Pig . Please find below the XML
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>
<amount>25</amount>
<tax>12</tax>
<total>37</total>
</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications with XML</description>
</book>
</catalog>
I am currently using XMLLoader to load the XML file and using regex to parse the XML
Code :
REGISTER piggybank.jar
A=LOAD '/users/books.xml' using org.apache.pig.piggybank.storage.XMLLoader
('book') as (x:chararray);
B=FOREACH A GENERATE(REGEX_EXTRACT_ALL(x,'<book.*?id="([^>]*?">.*?<author>([^>]*?)</author>.*?</book>'));
dump B;
I want to understand if there is any other way to parse XML - may be using a UDF. Is there any UDF available to parse XML or how can i create a UDF to serve my purpose. I am using Pig version 0.12 and XPath is not working in this version.
Thanks in advance