3

I am using pig latin for a large XML dump. I am trying to get the value of the xml node in pig latin. The file is like

< username>Shujaat< /username>

I want to get the input Shujaat. I tried piggybank XMLLoader but it only separates the xml tags and its values also. The code is

register piggybank.jar;

A = load 'username.xml' using org.apache.pig.piggybank.storage.XMLLoader('username')
as (x: chararray);

B = foreach A generate x;

This code gives me the username tags also and values too. I only need values. Any idea how to do that? I found out regular expression but didnt know much? Thanks

user7337271
  • 1,662
  • 1
  • 14
  • 23
shujaat
  • 279
  • 6
  • 17

1 Answers1

5

The example element you gave can be extracted with:

B = foreach A GENERATE REGEX_EXTRACT(x, '<username>(.*)</username>', 1) 
      AS name:chararray;

A nested element like this:

  <user>
    <id>456</id>
    <username>Taylor</username>
  </user>

can be extracted by with something like this:

B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x, 
     '<user>\\n\\s*<id>(.*)</id>\\n\\s*<username>(.*)</username>\\n\\s*</user>')) 
     as (id: int, name:chararray);

 (456,Taylor)

You will definitely need to define a more sophisticated regex that suits all of your needs, i.e: handles empty elements, attributes...etc. Another option is to write a custom UDF that uses Java libraries to parse the content of the XML so that you can avoid writing (over)complicated, error-prone regular expressions.

Lorand Bendig
  • 10,630
  • 1
  • 38
  • 45
  • 1
    Thanks man. It worked. You rock. Will get back to you when writing the UDF using java. :) – shujaat Dec 03 '12 at 03:43
  • A warning for using regex to parse XML. See http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg – Alexander Torstling Oct 07 '13 at 10:55
  • @AlexanderTorstling You're right, using regex is not the preferred way to parse xmls. That's why I mentioned that writing a UDF would be more definitely a better approach – Lorand Bendig Oct 15 '13 at 10:40