Multiple occurance issue in XML Parsing using hive

Question

I'm trying to create an Hive table out of XML file using com.ibm.spss.hive.serde2.xml.XmlSerDe. This works quite good for the tags with single occurrence. But I've an issue with multiple occurrence.

Below is my source XML.

<Item>
      <TimeStamp>2016-02-19T12:27:06.387Z</TimeStamp>
      <AlsoSeen End="2014-08-21T13:44:32.557Z" Start="2014-08-21T13:44:04.637Z" />
      <AlsoSeen End="2014-08-21T13:44:33.557Z" Start="2014-08-21T13:45:04.637Z" />
      <AlsoSeen End="2014-08-21T13:44:34.557Z" Start="2014-08-21T13:46:04.637Z" />
      <Title ID="112031424">FAULT IN OUR STARS, THE</Title>
      <FileName>The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG</FileName>
</Item>

Below is my Hive Table DDL

add jar hivexmlserde-1.0.5.3.jar;

CREATE EXTERNAL TABLE xml_test
(
         Item_TimeStamp String
         ,Item_AS_Start  String
         ,Item_AS_End    String
         ,Item_Title     String
         ,Item_ID        String
         ,Item_Artist    String
         ,Item_Author    String
         ,Item_FileName  String
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Item_TimeStamp"="/Item/TimeStamp/text()",
"column.xpath.Item_AS_Start"="/Item/AlsoSeen/@Start",
"column.xpath.Item_AS_End"="/Item/AlsoSeen/@End",  
"column.xpath.Item_Title"="/Item/Title/text()",   
"column.xpath.Item_ID"="/Item/Title/@ID",      
"column.xpath.Item_FileName"="/Item/FileName/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/xxxxxx/XML_text'
TBLPROPERTIES (
"xmlinput.start"="<Item>",
"xmlinput.end"="</Item>"
);

When I query the table I get the below value

2016-02-19T12:27:06.387Z <string>2014-08-21T13:44:04.637Z2014-08-21T13:45:04.637Z2014-08-21T13:46:04.637Z</string> <string>2014-08-21T13:44:32.557Z2014-08-21T13:44:33.557Z2014-08-21T13:44:34.557Z</string> FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG

But I expected output should be 3 different lines for 3 difference occurances of AlsoSeen tag as below

2016-02-19T12:27:06.387Z  2014-08-21T13:44:04.637Z 2014-08-21T13:44:32.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG
2016-02-19T12:27:06.387Z  2014-08-21T13:45:04.637Z 2014-08-21T13:44:33.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG  
2016-02-19T12:27:06.387Z  2014-08-21T13:46:04.637Z 2014-08-21T13:44:34.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG

Can anyone please help me?

I don't know anythig about that SerDe but the expressions sure look like XPath; so `/Item/AlsoSeen[2]/@Start` would get you the attrib of 2nd occurrence of element `` inside each element ``. But AFAIK you cannot iterate with XPath, you would need Java code -- or an XSLT stylesheet. — Samson Scharfrichter, May 25 '16 at 19:22
BTW, from the results, I guess the SerDe finds the "most granular common element" for all XPath expressions, assumes it means **one record**, and starts with spliiting the source XML in "records" before feeding each XPath expression into a column (i.e. a cell). — Samson Scharfrichter, May 25 '16 at 19:25
So I would advise you to **transform that XML before loading it into Hadoop**. With Linux utility `xsltproc` and a simple XSLT script, you could get a CSV with 1 column for all your "start" attributes separated by a space *(for a custom separator you would need an XSLT 2-compliant parser i.e. Saxon)* -- then you could read the CSV in Hive and "explode" the array into multiple rows. Trickier with multiple arrays but can be done with "indexed explode", and filtering out the cartesian mess by selecting only Start and End values with matching indexes. — Samson Scharfrichter, May 25 '16 at 19:31
@SamsonScharfrichter You're right, it's Xpath. Xpath supports wildcards like `*` , `@*` and `node()` which will match multiple values. SerDe is a Java library convention for HIVE that SERializes and DEserializes data, and there are SerDe for many file formats. Hive can use either 'delimited by' or SerDe to define file formats. In this case, the XML SerDe will do the iterating. The source code for this is here: https://github.com/dvasilen/Hive-XML-SerDe . I am suprised that this example worked though, see my answer below. — Davos, Apr 30 '17 at 14:42

score 1 · Answer 1 · answered Apr 30 '17 at 14:34

The documentation has some examples for this. https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources

Using text() or @Elementname gives you single (primitive) values, and also, your table has been declared as all STRING columns, you may need complex types to hold multiple values, i.e. Map, Struct or Array.

The behaviour you are seeing relates to (5. Complex Types) in the documentation in the link above.

"Complex content being used as a primitive type will be converted to a valid XML string by adding a root element called <string>"

In your example, this result you are seeing is the 3 values for the @Start attribute concatenated and then wrapped with the <string> tag:

<string>2014-08-21T13:44:04.637Z2014-08-21T13:45:04.637Z2014-08-21T13:46:04.637Z</string>

Your XPath is using @Start and @End, not the /* pattern though, so I am surprised it worked at all.

I understand the results you want, but I am not sure that it is really matching the way the data is modeled. I think that <Item> is effectively a single row. Item has a <TimeStamp>, and a <Title> which has an @ID attribute, and it also has a property called <AlsoSeen> which can have multiple values for @End & @Start. A straight mapping from XML to Hive table would probably be this complex datatype: ARRAY<STRUCT<End: TIMESTAMP, Start: TIMESTAMP>> , an Array of Structs containing End and Start timestamps.

Unfortunately your data doesn't conform to the String format that Hive expects for timestamps, so you can just use STRING, as per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp

So you would end up with this table create statement (I've removed the Item_Artist and Item_Author in your original create table because you had no matching xpath for those columns in the SERDEPROPERTIES )

CREATE EXTERNAL TABLE xml_test
(
         Item_TimeStamp String
         ,Item_AlsoSeen ARRAY<STRUCT<End: STRING, Start: STRING>>
         ,Item_Title     String
         ,Item_ID        String
         ,Item_FileName  String
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Item_TimeStamp"="/Item/TimeStamp/text()",
"column.xpath.Item_AlsoSeen"="/Item/AlsoSeen", 
"column.xpath.Item_Title"="/Item/Title/text()",   
"column.xpath.Item_ID"="/Item/Title/@ID",      
"column.xpath.Item_FileName"="/Item/FileName/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/xxxxxx/XML_text'
TBLPROPERTIES (
"xmlinput.start"="<Item>",
"xmlinput.end"="</Item>"
);

Note the xpath I used for the AlsoSeen field: "/Item/AlsoSeen"

This is the pattern for Arrays and Structs as in the documentation linked above.

It will give you results like

+--------------------+--------------------+--------------------+---------+--------------------+
|      Item_TimeStamp|       Item_AlsoSeen|          Item_Title|  Item_ID|       Item_FileName|
+--------------------+--------------------+--------------------+---------+--------------------+
|2016-02-19T12:27:...|[[2014-08-21T13:4...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
+--------------------+--------------------+--------------------+---------+--------------------+

The Item_AlsoSeen column contains this, an Array () of Structs [] separated by commas, each struct has the end and the start.

WrappedArray([2014-08-21T13:44:32.557Z,2014-08-21T13:44:04.637Z], [2014-08-21T13:44:33.557Z,2014-08-21T13:45:04.637Z], [2014-08-21T13:44:34.557Z,2014-08-21T13:46:04.637Z])

From there, you can query the result set you wanted by using the LATERAL VIEW explode() technique

e.g.

SELECT
Item_TimeStamp,
    --Item_AlsoSeen,
    ias.End,
    ias.Start,
    Item_Title, 
    Item_ID, 
    Item_FileName
FROM xml_test
    LATERAL VIEW explode(Item_AlsoSeen) ias as ias

Note that because I created AlsoSeen as an Array of Struct, the explode() function returns each struct as a row. LATERAL VIEW then effectively does a cross join or probably better called a CROSS APPLY to get a cartesian product.

+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|      Item_TimeStamp|                 End|               Start|          Item_Title|  Item_ID|       Item_FileName|
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:44:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:45:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:46:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+

Ref Hive Datatypes: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

score 0 · Answer 2 · edited May 23 '17 at 12:02

0

There are custom written UDF's which are array_index , numeric_range solves the problem easily. In order to use the functions the column type should be an Array. Please refer the post Hive Explode / Lateral View multiple arrays

edited May 23 '17 at 12:02

Community

1
1

answered May 26 '16 at 13:08

Sathyaraj

189
6
21

Multiple occurance issue in XML Parsing using hive

2 Answers2