8

We have XML data loaded into Hadoop as a single string column named XML. We are trying to retrieve to levels of data an normalize or explode it into single rows for processing (you know, like a table!) Have tried explode function, but not getting exactly what we want.

SAMPLE XML

<Reports>
<Report ID="1">
<Locations>
  <Location ID="20001">
    <LocationName>Irvine Animal Shelter</LocationName>
  </Location>   
  <Location ID="20002">
    <LocationName>Irvine City Hall</LocationName>
  </Location>   
  </Locations>
</Report>
<Report ID="2">
<Locations>
  <Location ID="10001">
    <LocationName>California Fish Grill</LocationName>
  </Location>   
  <Location ID="10002">
    <LocationName>Fukada</LocationName>
  </Location>   
  </Locations>
</Report>
</Reports>

QUERY 1

We are querying the higher level Report.Id and then the id and name from the child (Locations/Location). The following gives us essentially a cartesian product of all possible combinations (in this example, 8 rows instead of the 4 we are hoping for.)

SELECT xpath_int(xml, '/Reports/Report/@ID') AS id, location_id, location_name 
FROM   xmlreports 
LATERAL VIEW explode(xpath(xml, '/Reports/Report/Locations/Location/@ID')) myTable1 AS location_id 
LATERAL VIEW explode(xpath(xml, '/Reports/Report/Locations/Location/LocationName/text()')) myTable2 AS location_name;

QUERY 2

Tried to group into a structure and then explode, but this returns two rows and two arrays.

SELECT id, loc.col1, loc.col2
FROM (
SELECT xpath_int(xml, '/Reports/Report/@ID') AS id, 
       array(struct(xpath(xml, '/Reports/Report/Locations/Location/@ID'), xpath(xml,     '/Reports/Report/Locations/Location/LocationName/text()'))) As foo
FROM   xmlreports) x
LATERAL VIEW explode(foo) exploded_table as loc;

RESULTS

1   ["20001","20002"]       ["Irvine Animal Shelter","Irvine City Hall"]
2   ["10001","10002"]       ["California Fish Grill","Irvine Spectrum"]

WHAT WE WANT IS

1   "20001" "Irvine Animal Shelter"
1   "20002" "Irvine City Hall"
2   "10001" "California Fish Grill"
2   "10002" "Irvine Spectrum"

Seems like a common thing to want to do, but can't find any examples. Any help is greatly appreciated.

Jim Barnett
  • 81
  • 1
  • 1
  • 2
  • I think your case is similar to what is described [here](http://stackoverflow.com/q/11373543/203856) (which boils down, given your first attempt, to `explode` only once instead of twice). I hope this helps. – larsen Mar 13 '13 at 09:11
  • Did read that other post, and came up with query 2 to try that out. Since that post dealt with exploding items that are and array of structs. However, that gave a totally unexpected result of two arrays. – Jim Barnett Mar 13 '13 at 16:48
  • Update, Have used Python to do this externally, but still hoping someone can come up with a solution using only Hive Query Language. – Jim Barnett Mar 26 '13 at 17:21
  • jim. would tell how create your xmltable, can put the whole create syntax? – user2539175 Jul 01 '13 at 13:48

1 Answers1

3

I see two ways to solve this problem.

  1. Create custom UDF which will parse one XML element and return array that you need. After that explode array.

  2. Use subselects.

I implemented solution 2 using subselects. Even when using subselects Hive is "smart enough" to create only one map-reduce job for this, so I think you will not have performance problems.

SELECT 
 l2.key,
 l2.rid,
 l2.location_id,
 location_name
FROM (
 SELECT 
  l1.key as key,  
  l1.rid as rid, 
  location_id as location_id,
  l1.xml as xml
 FROM (
   SELECT key, xml, rid
   FROM xmlreports
   LATERAL VIEW explode(xpath(xml, '/Reports/Report/@ID')) rids as rid
 ) l1
 LATERAL VIEW explode(xpath(l1.xml, concat('/Reports/Report[@ID = ',l1.rid, ']/Locations/Location/@ID'))) locids as location_id
) l2
LATERAL VIEW explode(xpath(l2.xml, concat('/Reports/Report[@ID = ',l2.rid, ']/Locations/Location[@ID = ', l2.location_id ,' ]/LocationName/text()'))) locnames as location_name;

After running this query on XML file you provided I got results that you are searching for

1   1   20001   Irvine Animal Shelter
1   1   20002   Irvine City Hall
1   2   10001   California Fish Grill
1   2   10002   Fukada

Hope this solves your problem.

Regards, Dino

dino.keco
  • 1,401
  • 1
  • 12
  • 18