I am trying to load deeply nested JSON data into hive tables. Let me tell you guys what I tried so far.
1- I have JSON files and they are deeply nested like array of structs which have struct fields again.
2- I successfully loaded this json data into Spark Data frame and able to see the schema. Further I successful store this dataframe as a hive table using below command from spark shell.
new org.apache.spark.sql.hive.HiveContext(sc).read.json(
"/user/alpha/test.json").saveAsTable("mywarehouse.patent_data_2001");
But when I try to any query like select * from patent_data_2001 limit 1
It gives me below error
FAILED: IllegalArgumentException Error: type expected at the position 4339 of 'array<struct<id:string>>:struct<claim:array<struc
.
.
.
.
.
,wi:string>,nb_file:string>>:struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>>' but 'stru
c' is found.
3- I tried to use Hive serde instead of Spark SQL spllied jars using below command from spark shell
hc.sql("SET spark.sql.hive.convertMetastoreParquet=false")
Still the same error.
It creates the table in hive warehouse and load the data but it gives me error when I try to query the table or even describe the table.
4- Assuming this issue might be related to Spark SQL and Hive compatibility. I thought of fixing the schema since the table was created and only table schema was faulty. I tried to fix the schema manually and yeah it was a long and really time taking process and finally fixed the schema. I created the hive tables again manually using Create table statement
as show below.
CREATE TABLE `patent_data_2001`(
`abstract` array<struct<id:string>> COMMENT '',
`claims` struct<claim:array<struct<id:string,num:string>>,id:string> COMMENT '',
`country` string COMMENT '',
`date_produced` string COMMENT '',
`date_publ` string COMMENT '',
`description` string COMMENT '',
`drawings` struct<figure:array<struct<id:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,orientation:string,wi:string>,num:string>>,id:string> COMMENT '',
`dtd_version` string COMMENT '',
`file` string COMMENT '',
`id` string COMMENT '',
`lang` string COMMENT '',
`status` string COMMENT '',
`table_external_doc` array<string> COMMENT '',
`us_bibliographic_data_grant` struct<application_reference:struct<appl_type:string,document_id:struct<country:string,`date`:string,doc_number:string>>,assignees:struct<assignee:array<struct<addressbook:struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string,role:string>,first_name:string,last_name:string,orgname:string,role:string>>>,classification_locarno:struct<edition:string,main_classification:string>,classification_national:array<struct<country:string,main_classification:string>>,classifications_cpc:struct<further_cpc:struct<classification_cpc:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>,combination_set:array<struct<combination_rank:array<struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>,rank_number:string>>,group_number:string>>>,main_cpc:struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,classifications_ipcr:struct<classification_ipcr:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_level:string,classification_status:string,classification_value:string,generating_office:struct<country:string>,ipc_version_indicator:struct<`date`:string>,main_group:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,examiners:struct<assistant_examiner:struct<first_name:string,last_name:string>,primary_examiner:struct<department:string,first_name:string,last_name:string>>,invention_title:string,number_of_claims:string,pct_or_regional_filing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>,us_371c124_date:struct<`date`:string>,us_371c12_date:struct<`date`:string>>,pct_or_regional_publishing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,priority_claims:struct<priority_claim:array<struct<country:string,`date`:string,doc_number:string,kind:string,sequence:string>>>,publication_reference:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,rule_47_flag:string,us_application_series_code:string,us_botanic:struct<latin_name:string,variety:string>,us_field_of_classification_search:struct<classification_national:array<struct<country:string,main_classification:string>>>,us_parties:struct<agents:struct<agent:array<struct<addressbook:array<struct<address:struct<country:string>,first_name:string,last_name:string,orgname:string>>,rep_type:string,sequence:string>>>,inventors:struct<inventor:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string>>,designation:string,sequence:string>>>,us_applicants:struct<us_applicant:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string>>,app_type:string,applicant_authority_category:string,designation:string,sequence:string>>>>,us_references_cited:struct<us_citation:array<struct<classification_national:array<struct<country:string,main_classification:string>>>>>,us_related_documents: struct < continuation: array< struct< relation: struct< child_doc: struct< document_id: struct < country: string, `date`: string, doc_number: string > >, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string>>>>, continuation_in_part: array< struct < relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string,doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, division: array< struct< relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string >, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, reissue: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct < country: string,`date`:string,doc_number: string > ,parent_grant_document: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string >>, parent_status: string >>>>, related_publication: array< struct < document_id: array < struct< country: string,`date`:string,doc_number: string, kind: string>>>>, substitution: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_doc: struct< document_id: struct < country: string,`date`:string, doc_number: string >, parent_status: string >>>>, us_provisional_application: array< struct < document_id: struct < country: string,`date`:string, doc_number: string >>>, us_term_of_grant: struct < disclaimer: array < struct< text: string>>>>>,
`us_chemistry` array<struct<cdx_file:string,idref:string,mol_file:string>> COMMENT '',
`us_claim_statement` string COMMENT '',
`us_math` array<struct<idrefs:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,wi:string>,nb_file:string>> COMMENT '',
`us_sequence_list_doc` struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>> COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001';
BUT again I got the same error.
5- I tried to use Serde's as listed below in below links
https://github.com/rcongiu/Hive-JSON-Serde
https://github.com/proofpoint/hive-serde
http://www.congiu.net/hive-json-serde/1.3.6/
but no luck at all.
6- While researching I found that it may be a problem mentioned in these JIRA
**ArrayIndexOutOfBounds exception for deeply nested structs**
https://issues.apache.org/jira/browse/HIVE-3253
**Support nested structs over 24 levels.**
https://issues.apache.org/jira/browse/HIVE-9500
and I ran table creation with below serde
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true' )
But I am getting the same error with no other details.
I can't figure out why these errors are happening and why hive is not letting me query the table when it is letting me create it successfully.
Any help or suggestion will be really helpful. Please help.
Many thanks.