0

I want to parse an XML file using Pig . Please find below the XML

<catalog>
  <book id="bk101">
    <author>Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <genre>Computer</genre>
    <price>
       <amount>25</amount>
       <tax>12</tax>
       <total>37</total>
    </price>
    <publish_date>2000-10-01</publish_date>
    <description>An in-depth look at creating applications with XML</description>
  </book>
</catalog>

I am currently using XMLLoader to load the XML file and using regex to parse the XML

Code :

REGISTER piggybank.jar

A=LOAD '/users/books.xml' using org.apache.pig.piggybank.storage.XMLLoader
('book') as (x:chararray);

B=FOREACH A GENERATE(REGEX_EXTRACT_ALL(x,'<book.*?id="([^>]*?">.*?<author>([^>]*?)</author>.*?</book>'));

dump B;

I want to understand if there is any other way to parse XML - may be using a UDF. Is there any UDF available to parse XML or how can i create a UDF to serve my purpose. I am using Pig version 0.12 and XPath is not working in this version.

Thanks in advance

1 Answers1

0

if you're using regex, which you shouldn't, you probably also aren't too worried about speed, so just use lazy dotall (.*) quantifiers:

 <book.*?id="(.*?)".*?<author>(.*?)<\/author>.*?<title>(.*?)<\/title>

demo

Community
  • 1
  • 1
Scott Weaver
  • 7,192
  • 2
  • 31
  • 43