I haven't found any approach to parse string with whole XML doc into separate tuples, pls suggest me how can I do this?
Suppose we have avro file:
{fieldname: id, fieldname: xml}
Xml structure:
<?xml version='1.0' encoding='UTF-8'?>
<response>
<name>Ghty</name>
<main>
<data>
<id>1</id>
<text>ABC mask</text>
<title>Some text</title>
</data>
<data>
<id>2</id>
<text>Second value</text>
<title>To</title>
</data>
<data>
<id>3</id>
<text>Evolving to</text>
<title>Hint 567</title>
</data>
</main>
</response>
When we do a load from xml file, its clear that input xml splits into parts, according to the tag we put into statement:
DEFINE XMLLoader org.apache.pig.piggybank.storage.XMLLoader('data');
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
xml = LOAD '$XPATH' using XMLLoader as (x:chararray);
DUMP xml;
(<data><id>1</id><text>ABC mask</text><title>Some text</title></data>)
(<data><id>2</id><text>Second value</text><title>To</title></data>)
(<data><id>3</id><text>Evolving to</text><title>Hint 567</title></data>)
xml_parse = FOREACH xml GENERATE
XPath(x, 'data/id') as (id:chararray),
XPath(x, 'data/text') as (text:chararray),
XPath(x, 'data/title') as (title:chararray);
DUMP xml_parse;
(1,ABC mask,Some text)
(2,Second value,To)
(3,Evolving to,Hint 567)
I want to do the same with the xml in the string, without LOAD operation. But how can we do the same if we have such xmls in a string and they are not splited for further XPath action?
(<?xml version='1.0' encoding='UTF-8'?><response><name>Ghty</name><main><data><id>1</id><text>ABC mask</text><title>Some text</title></data><data><id>2</id><text>Second value</text><title>To</title></data><data><id>3</id><text>Evolving to</text><title>Hint 567</title></data></main></response>)
1. I tried to apply this approach, but haven't got any success, because I'm getting only the first element from xml string:
xml = LOAD 'xml_set.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();
xml_parse = foreach xml generate
XPath($0, 'data');
DUMP xml_parse ;
(1,ABC mask,Some text)
2. I tried to use XPathAll, but haven't got success as well, all values was put in one tuple:
xml = LOAD 'xml_set.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();
xml_parse = foreach xml generate
XPathAll($0, 'data'),
XPathAll($0, 'data'),
XPathAll($0, 'data'),
DUMP xml_parse ;
((1,ABC mask,Some text,2,Second value,To,3,Evolving to,Hint 567))
3. Then I tried to use XPathAll with full tag paths, but result was a tuple of tuples. I need somehow to split them in a right order, but don't know how.
xml = LOAD 'xml_set.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();
xml_parse = foreach xml generate
XPathAll($0, 'data/id'),
XPathAll($0, 'data/text'),
XPathAll($0, 'data/title'),
DUMP xml_parse ;
((1,2,3),(ABC mask,Second value, Evolving to),(Some text,To,Hint 567))
Seems need some kind of pivot to be done here. The goal is to get:
(1,ABC mask,Some text)
(2,Second value,To)
(3,Evolving to,Hint 567)
Ofc I can store all xmls from avro to 1 big xml file and then load it with XMLLoader, but its redundunt step I assume.
Appreciate any help and suggestions. Stuck with it for a long time (((