5

I'm trying to load XML data into Hive but I'm getting an error :

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"xmldata":""}

The xml file i have used is :

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book>
  <id>11</id>
  <genre>Computer</genre>
  <price>44</price>
</book>
<book>
  <id>44</id>
  <genre>Fantasy</genre>
  <price>5</price>
</book>
</catalog>

The hive query i have used is :

1) Create TABLE xmltable(xmldata string) STORED AS TEXTFILE;
LOAD DATA lOCAL INPATH '/home/user/xmlfile.xml' OVERWRITE INTO TABLE xmltable;

2) CREATE VIEW xmlview (id,genre,price)
AS SELECT
xpath(xmldata, '/catalog[1]/book[1]/id'),
xpath(xmldata, '/catalog[1]/book[1]/genre'),
xpath(xmldata, '/catalog[1]/book[1]/price')
FROM xmltable;

3) CREATE TABLE xmlfinal AS SELECT * FROM xmlview;

4) SELECT * FROM xmlfinal WHERE id ='11

Till 2nd query everything is fine but when i executed the 3rd query it's giving me error:

The error is as below:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"xmldata":"<?xml version=\"1.0\" encoding=\"UTF-8\"?>"}
    at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:159)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error    while processing row {"xmldata":"<?xml version=\"1.0\" encoding=\"UTF-8\"?>"}
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:675)
    at org.apache.hadoop.hive.ql.exec

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

So where it's going wrong? Also I'm using the proper xml file.

Thanks, Shree

AkshayP
  • 2,141
  • 2
  • 18
  • 27
shree11
  • 535
  • 4
  • 13
  • 26

6 Answers6

4

Reason for error :

1) case-1 : (your case) - xml content is being fed to hive as line by line.

input xml:

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book>
  <id>11</id>
  <genre>Computer</genre>
  <price>44</price>
</book>
<book>
  <id>44</id>
  <genre>Fantasy</genre>
  <price>5</price>
</book>
</catalog>  

check in hive :

select count(*) from xmltable;  // return 13 rows - means each line in individual row with col xmldata  

Reason for err :

XML is being read as 13 pieces not at unified. so invalid XML

2) case-2 : xml content should be fed to hive as singleString - XpathUDFs works refer syntax : All functions follow the form: xpath_(xml_string, xpath_expression_string).* source

input.xml

<?xml version="1.0" encoding="UTF-8"?><catalog><book><id>11</id><genre>Computer</genre><price>44</price></book><book><id>44</id><genre>Fantasy</genre><price>5</price></book></catalog>

check in hive:

select count(*) from xmltable; // returns 1 row - XML is properly read as complete XML.

Means :

xmldata   = <?xml version="1.0" encoding="UTF-8"?><catalog><book> ...... </catalog>

then apply your xpathUDF like this

select xpath(xmldata, 'xpath_expression_string' ) from xmltable
vijay kumar
  • 2,049
  • 1
  • 15
  • 18
  • Hi you were right that i should fed xmldata as a single string and now i'm able to create xmlview without any error. But i'm not getting the proper result. I used the same queries which i posted above. In that when i fired 4rd query i.e., `SELECT * FROM xmlfinal` getting the result as `[] [] [] ` – shree11 Jul 09 '14 at 05:47
  • Alos if i use `xpath_string` instead of `xpath` , getting only the 1st row as output i.e., `11 Computer 44`. But i want both rows to be returned as result. Why `XPATH` is not returning any result ? – shree11 Jul 09 '14 at 06:00
  • you want output like this ? row:1 11 computer 44 , row:2 44 fantacy five – vijay kumar Jul 09 '14 at 09:22
  • Yes, i want the output as row1 , row2.. `11 Computer 44` `44 fantasy 5` . But how can i achieve it? – shree11 Jul 09 '14 at 13:12
  • Hi vijay, i have one doubt. As u suggested to ADD jar. I din't get that? do i need to create jar or do i need to add existing jar? i din't get that part. Can you please tell me how can i do it? Also where i need to add that jar(location)? – shree11 Jul 09 '14 at 13:22
  • download - https://github.com/klout/brickhouse/archive/master.zip ,unzip , cd brickhouse ,mvn package,it will create brickhouse-0.7.0-SNAPSHOT.jar in brickhouse/target/brickhouse-0.7.0-SNAPSHOT.jar, place this jar in hive machine , i placed in /home/vijay/ , Add jar to hive terminal as in answer. – vijay kumar Jul 09 '14 at 13:54
4

Find Jar here -- > Brickhouse ,

sample example here --> Example

similar example in stackoverflow - here

Solution:

--Load xml data to table
DROP table xmltable;
Create TABLE xmltable(xmldata string) STORED AS TEXTFILE;
LOAD DATA lOCAL INPATH '/home/vijay/data-input.xml' OVERWRITE INTO TABLE xmltable;

-- check contents
SELECT * from xmltable;

-- create view
Drop view  MyxmlView;
CREATE VIEW MyxmlView(id, genre, price) AS
SELECT
 xpath(xmldata, 'catalog/book/id/text()'),
 xpath(xmldata, 'catalog/book/genre/text()'),
 xpath(xmldata, 'catalog/book/price/text()')
FROM xmltable;

-- check view
SELECT id, genre,price FROM MyxmlView;


ADD jar /home/vijay/brickhouse-0.7.0-SNAPSHOT.jar;  --Add brickhouse jar 

CREATE TEMPORARY FUNCTION array_index AS 'brickhouse.udf.collect.ArrayIndexUDF';
CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange';

SELECT 
   array_index( id, n ) as my_id,
   array_index( genre, n ) as my_genre,
   array_index( price, n ) as my_price
from MyxmlView
lateral view numeric_range( size( id )) MyxmlView as n;

Output:

hive > SELECT
     >    array_index( id, n ) as my_id,
     >    array_index( genre, n ) as my_genre,
     >    array_index( price, n ) as my_price
     > from MyxmlView
     > lateral view numeric_range( size( id )) MyxmlView as n;
Automatically selecting local only mode for query
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/vijay/.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-07-09 05:36:45,220 null map = 0%,  reduce = 0%
2014-07-09 05:36:48,226 null map = 100%,  reduce = 0%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
my_id      my_genre      my_price
11      Computer        44
44      Fantasy 5

Time taken: 8.541 seconds, Fetched: 2 row(s)

Adding-more-info as requested by Question owner:

enter image description here enter image description here

Community
  • 1
  • 1
vijay kumar
  • 2,049
  • 1
  • 15
  • 18
  • i added the snapshot jar you specified and when i run the code `SELECT array_index( id, n ) as my_id, array_index( genre, n ) as my_genre, array_index( price, n ) as my_price from xmlView lateral view numeric_range( size( id )) xmlView as n;` getting an error below. `FAILED: SemanticException [Error 10016]: Line 6:34 Argument type mismatch 'id': "map" or "list" is expected at function SIZE, but "int" is found`. And while creating the VIEW i have used `XPATH_int `for id and `XPATH_string` for genre & price. Also tried with giving only 'XPATH' but still same error exits. – shree11 Jul 16 '14 at 05:10
  • use XPATH as i used, cause XPath return array , while Xpath_String retruns only one string. create view as i shown in answer. – vijay kumar Jul 16 '14 at 05:13
  • official hive doc says : The xpath() function always returns a hive array of strings . The xpath_string() function returns the text of the first matching node - but we need for all books , not one. First of all make sure that MyxmlView is fine. can you paste result of : SELECT id, genre,price FROM MyxmlView; – vijay kumar Jul 16 '14 at 05:16
  • I used`XPATH` and when i do SELECT id,genre,price from xmlview; Getting the output as `[] [] []`. And after adding jar and creating 2 temporary functions, when i execute SELECT array_index( id, n ) as my_id, array_index( genre, n ) as my_genre, array_index( price, n ) as my_price from xmlView lateral view numeric_range( size( id )) xmlView as n;. The query is excecuting properly without giving any output as `Job 0: Map: 1 Cumulative CPU: 2.86 sec HDFS Read: 402 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 860 msec OK Time taken: 19.751 seconds` – shree11 Jul 16 '14 at 05:49
  • plz check the snapshot. seems something wrong in your data loading part or view – vijay kumar Jul 16 '14 at 06:20
  • i'm not getting what's going wrong.Even i checked with the loading part as well. – shree11 Jul 17 '14 at 04:30
  • SELECT * from xmltable; is also empty ? – vijay kumar Jul 17 '14 at 05:08
  • No, it's not empty. It shows complete xml file i have loaded as below:`hive> select * from xmltable; OK 11Computer4444Fantasy5 Time taken: 2.385 seconds` – shree11 Jul 17 '14 at 05:35
  • problem with your view it seems! – vijay kumar Jul 17 '14 at 10:25
  • May be the problem is with view, because if i use only XPATH in a view as u suggested and when i fire a query `SELECT * FROM id,genre,price from xmlview`, i ll get the blank output as below : `[] [] []`. – shree11 Jul 17 '14 at 10:33
  • @mamisetty, The screenshots you attached were helped me a lot. Now i'm able to read the XML data in hive. The problem was with view creation .I was using `/catalog[1]/book[1]/id` and when i changed to `/catalog/book/id/text()` for all field solved the problem. Thanks for your support and also listening to all my queries patiently. Thanks a lot. – shree11 Jul 17 '14 at 11:45
  • cool. plz accept if useful. Note that when using XpathUDF validate'xpath_expression_string'and then jump into next steps – vijay kumar Jul 17 '14 at 12:27
  • Accepted. I dint get using `xpath_expression_string` and why it is needed before starting the next steps? How it will validate? – shree11 Jul 17 '14 at 12:44
1

then follow the below steps to get the solution as like as you want, just change the source data this

 <catalog><book><id>11</id><genre>Computer</genre><price>44</price></book></catalog>
<catalog><book><id>44</id><genre>Fantasy</genre><price>5</price></book></catalog> 

now try below steps:

select xpath(xmldata, '/catalog/book/id/text()')as id,
xpath(xmldata, '/catalog/book/genre/text()')as genre,
xpath(xmldata, '/catalog/book/price/text()')as price FROM xmltable;

now you will get ans as like this:

["11"] ["Computer"] ["44"]

["44"] ["Fantasy"] ["5"]

if you apply xapth_string, xpath_int, xpath_int udfs the you will get ans like

11 computer 44

44 Fantasy 5.

Thanks

0

First try to load file my add file path-to-file, that will solve your problem as It is solved in my case

Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0

Oracle XML Extensions for Hive can be used to create Hive tables over XML like this. https://docs.oracle.com/cd/E54130_01/doc.26/e54142/oxh_hive.htm#BDCUG691

Josh
  • 413
  • 2
  • 7
0

Also ensure that the XML file doesn't contain any empty spaces at the end of the last closing tag. In my case, the source file had one, and whenever I loaded the file into hive, my resulting table contained NULLS in them. So whenever I applied an xpath function, the result would have a few of these [] [] [] [] [] []

Although the xpath_string function worked, the xpath_double and xpath_int functions never did. It kept throwing this exception -

Diagnostic Messages for this Task:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"line":""}
Animesh
  • 176
  • 1
  • 1
  • 10