How to parse XML string (not file) in PIG (without load)

Question

I haven't found any approach to parse string with whole XML doc into separate tuples, pls suggest me how can I do this?

Suppose we have avro file:

{fieldname: id, fieldname: xml}

Xml structure:

<?xml version='1.0' encoding='UTF-8'?>
<response>
    <name>Ghty</name>
    <main>
        <data>
            <id>1</id>
            <text>ABC mask</text>
            <title>Some text</title>
        </data>
        <data>
            <id>2</id>
            <text>Second value</text>
            <title>To</title>
        </data>
        <data>
            <id>3</id>
            <text>Evolving to</text>
            <title>Hint 567</title>
        </data>
    </main>
</response>

When we do a load from xml file, its clear that input xml splits into parts, according to the tag we put into statement:

DEFINE XMLLoader org.apache.pig.piggybank.storage.XMLLoader('data');
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
xml = LOAD '$XPATH' using XMLLoader as (x:chararray);
DUMP xml;

(<data><id>1</id><text>ABC mask</text><title>Some text</title></data>)
(<data><id>2</id><text>Second value</text><title>To</title></data>)
(<data><id>3</id><text>Evolving to</text><title>Hint 567</title></data>)

xml_parse = FOREACH xml GENERATE
    XPath(x, 'data/id') as (id:chararray), 
    XPath(x, 'data/text') as (text:chararray), 
    XPath(x, 'data/title') as (title:chararray);

DUMP xml_parse;

(1,ABC mask,Some text)
(2,Second value,To)
(3,Evolving to,Hint 567)

I want to do the same with the xml in the string, without LOAD operation. But how can we do the same if we have such xmls in a string and they are not splited for further XPath action?

(<?xml version='1.0' encoding='UTF-8'?><response><name>Ghty</name><main><data><id>1</id><text>ABC mask</text><title>Some text</title></data><data><id>2</id><text>Second value</text><title>To</title></data><data><id>3</id><text>Evolving to</text><title>Hint 567</title></data></main></response>)

1. I tried to apply this approach, but haven't got any success, because I'm getting only the first element from xml string:

xml = LOAD 'xml_set.avro' using  
org.apache.pig.piggybank.storage.avro.AvroStorage();

xml_parse = foreach xml generate
    XPath($0, 'data');

DUMP xml_parse ;

(1,ABC mask,Some text)

2. I tried to use XPathAll, but haven't got success as well, all values was put in one tuple:

xml = LOAD 'xml_set.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();

xml_parse = foreach xml generate
    XPathAll($0, 'data'),
    XPathAll($0, 'data'),
    XPathAll($0, 'data'),
DUMP xml_parse ;

((1,ABC mask,Some text,2,Second value,To,3,Evolving to,Hint 567))

3. Then I tried to use XPathAll with full tag paths, but result was a tuple of tuples. I need somehow to split them in a right order, but don't know how.

xml = LOAD 'xml_set.avro' using  
org.apache.pig.piggybank.storage.avro.AvroStorage();

xml_parse = foreach xml generate
    XPathAll($0, 'data/id'),
    XPathAll($0, 'data/text'),
    XPathAll($0, 'data/title'),
DUMP xml_parse ;

((1,2,3),(ABC mask,Second value, Evolving to),(Some text,To,Hint 567))

Seems need some kind of pivot to be done here. The goal is to get:

(1,ABC mask,Some text)
(2,Second value,To)
(3,Evolving to,Hint 567)

Ofc I can store all xmls from avro to 1 big xml file and then load it with XMLLoader, but its redundunt step I assume.

Appreciate any help and suggestions. Stuck with it for a long time (((

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — kirkpatt, Jul 13 '17 at 22:10

How to parse XML string (not file) in PIG (without load)

0 Answers0