3

I am trying to load deeply nested JSON data into hive tables. Let me tell you guys what I tried so far.

1- I have JSON files and they are deeply nested like array of structs which have struct fields again.

2- I successfully loaded this json data into Spark Data frame and able to see the schema. Further I successful store this dataframe as a hive table using below command from spark shell.

new org.apache.spark.sql.hive.HiveContext(sc).read.json(
"/user/alpha/test.json").saveAsTable("mywarehouse.patent_data_2001");

But when I try to any query like select * from patent_data_2001 limit 1

It gives me below error

FAILED: IllegalArgumentException Error: type expected at the position 4339 of 'array<struct<id:string>>:struct<claim:array<struc
.
.
.
.
.
,wi:string>,nb_file:string>>:struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>>' but 'stru
c' is found.

3- I tried to use Hive serde instead of Spark SQL spllied jars using below command from spark shell

 hc.sql("SET spark.sql.hive.convertMetastoreParquet=false")

Still the same error.

It creates the table in hive warehouse and load the data but it gives me error when I try to query the table or even describe the table.

4- Assuming this issue might be related to Spark SQL and Hive compatibility. I thought of fixing the schema since the table was created and only table schema was faulty. I tried to fix the schema manually and yeah it was a long and really time taking process and finally fixed the schema. I created the hive tables again manually using Create table statement as show below.

CREATE TABLE `patent_data_2001`(
  `abstract` array<struct<id:string>> COMMENT '', 
  `claims` struct<claim:array<struct<id:string,num:string>>,id:string> COMMENT '', 
  `country` string COMMENT '', 
  `date_produced` string COMMENT '', 
  `date_publ` string COMMENT '', 
  `description` string COMMENT '', 
  `drawings`      struct<figure:array<struct<id:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,orientation:string,wi:string>,num:string>>,id:string> COMMENT '', 
  `dtd_version` string COMMENT '', 
  `file` string COMMENT '', 
  `id` string COMMENT '', 
  `lang` string COMMENT '', 
  `status` string COMMENT '', 
  `table_external_doc` array<string> COMMENT '', 
  `us_bibliographic_data_grant` struct<application_reference:struct<appl_type:string,document_id:struct<country:string,`date`:string,doc_number:string>>,assignees:struct<assignee:array<struct<addressbook:struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string,role:string>,first_name:string,last_name:string,orgname:string,role:string>>>,classification_locarno:struct<edition:string,main_classification:string>,classification_national:array<struct<country:string,main_classification:string>>,classifications_cpc:struct<further_cpc:struct<classification_cpc:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>,combination_set:array<struct<combination_rank:array<struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>,rank_number:string>>,group_number:string>>>,main_cpc:struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,classifications_ipcr:struct<classification_ipcr:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_level:string,classification_status:string,classification_value:string,generating_office:struct<country:string>,ipc_version_indicator:struct<`date`:string>,main_group:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,examiners:struct<assistant_examiner:struct<first_name:string,last_name:string>,primary_examiner:struct<department:string,first_name:string,last_name:string>>,invention_title:string,number_of_claims:string,pct_or_regional_filing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>,us_371c124_date:struct<`date`:string>,us_371c12_date:struct<`date`:string>>,pct_or_regional_publishing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,priority_claims:struct<priority_claim:array<struct<country:string,`date`:string,doc_number:string,kind:string,sequence:string>>>,publication_reference:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,rule_47_flag:string,us_application_series_code:string,us_botanic:struct<latin_name:string,variety:string>,us_field_of_classification_search:struct<classification_national:array<struct<country:string,main_classification:string>>>,us_parties:struct<agents:struct<agent:array<struct<addressbook:array<struct<address:struct<country:string>,first_name:string,last_name:string,orgname:string>>,rep_type:string,sequence:string>>>,inventors:struct<inventor:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string>>,designation:string,sequence:string>>>,us_applicants:struct<us_applicant:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string>>,app_type:string,applicant_authority_category:string,designation:string,sequence:string>>>>,us_references_cited:struct<us_citation:array<struct<classification_national:array<struct<country:string,main_classification:string>>>>>,us_related_documents: struct < continuation: array< struct< relation: struct< child_doc: struct< document_id: struct < country: string,   `date`: string, doc_number: string > >, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string>>>>, continuation_in_part: array< struct < relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string,doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, division: array< struct< relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string >, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, reissue: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct < country: string,`date`:string,doc_number: string > ,parent_grant_document: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string >>, parent_status: string >>>>, related_publication: array< struct < document_id: array < struct< country: string,`date`:string,doc_number: string, kind: string>>>>, substitution: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_doc: struct< document_id: struct < country: string,`date`:string, doc_number: string >, parent_status: string >>>>, us_provisional_application: array< struct < document_id: struct < country: string,`date`:string, doc_number: string >>>, us_term_of_grant: struct < disclaimer: array < struct< text: string>>>>>, 
  `us_chemistry` array<struct<cdx_file:string,idref:string,mol_file:string>> COMMENT '', 
  `us_claim_statement` string COMMENT '', 
  `us_math` array<struct<idrefs:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,wi:string>,nb_file:string>> COMMENT '', 
  `us_sequence_list_doc` struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>> COMMENT '')
  ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001';

BUT again I got the same error.

5- I tried to use Serde's as listed below in below links

https://github.com/rcongiu/Hive-JSON-Serde
https://github.com/proofpoint/hive-serde
http://www.congiu.net/hive-json-serde/1.3.6/

but no luck at all.

6- While researching I found that it may be a problem mentioned in these JIRA

**ArrayIndexOutOfBounds exception for deeply nested structs**
https://issues.apache.org/jira/browse/HIVE-3253

**Support nested structs over 24 levels.**
https://issues.apache.org/jira/browse/HIVE-9500

and I ran table creation with below serde

ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true' )

But I am getting the same error with no other details.

I can't figure out why these errors are happening and why hive is not letting me query the table when it is letting me create it successfully.

Any help or suggestion will be really helpful. Please help.

Many thanks.

dtolnay
  • 9,621
  • 5
  • 41
  • 62
Hadoop-worker
  • 196
  • 11
  • I am also getting the same error, my query was working in Hive 0.12, but I am getting this now in Hive 1.2. – Mukesh S Nov 10 '16 at 13:44
  • @Mukesh S: Hi I am also using Hive 1.2 and gettting this error. I couldn't find a solution as my xml line was more that 2.1 Gb. Luckily I have that line a file therefore just removed it. I couldn't find a setting in hive which was useful. Please let me know if you find any thing. Thank you. – Hadoop-worker Nov 10 '16 at 15:03
  • hello , I also have the same issue . can you please give me workaround for this ? – Avinash Nishanth S Sep 11 '17 at 06:22

0 Answers0