1

I am looking to find a way through Hive to take the following AVSC file content and externalize the nested schema "RENTALRECORDTYPE" for the purposes of schema reuse.

{
    "type": "record",
    "name": "EMPLOYEE",
    "namespace": "",
    "doc": "EMPLOYEE is a person that works here",
    "fields": [
        {
            "name": "RENTALRECORD",
            "type": {
                "type": "record",
                "name": "RENTALRECORDTYPE",
                "namespace": "",
                "doc": "Rental record is a record that is kept on every item rented",
                "fields": [
                    {
                        "name": "due_date",
                        "doc": "The date when item is due",
                        "type": "int"
                    } 
                ]
            }
        },
        {
            "name": "hire_date",
            "doc": "Employee date of hire",
            "type": "int"
        }
    ]
}

This method of defining the schema works fine. I am able to issue the following HiveQL statement and the table is created successfully.

CREATE EXTERNAL TABLE employee
STORED AS AVRO
LOCATION '/user/dtom/store/data/employee'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema/employee.avsc');

However, I want to be able to reference an existing schema instead of duplicating the record definition in multiple schemas. For example, instead of a single schema file, two AVSC files will be generated. i.e. rentalrecord.avsc and employee.avsc.

rentalrecord.avsc

{
    "type": "record",
    "name": "RENTALRECORD",
    "namespace": "",
    "doc": "A record that is kept for every rental",
    "fields": [
        {
            "name": "due_date",
            "doc": "The date on which the rental is due back to the store",
            "type": "int"
        }
    ]
}

employee.avsc

{
    "type": "record",
    "name": "EMPLOYEE",
    "namespace": "",
    "doc": "EMPLOYEE is a person that works for the VIDEO STORE",
    "fields": [
        {
            "name": "rentalrecord",
            "doc": "A rental record is a record on every rental",
            "type": "RENTALRECORD"
        },
        {
            "name": "hire_date",
            "doc": "Employee date of hire",
            "type": "int"
        }
    ]
}

In the above scenario, we want to be able to externalize the RENTALRECORD schema definition and be able to reuse it in employee.avsc and elsewhere.

When attempting to import the schema using the following two HiveQL statements, it fails…

CREATE EXTERNAL TABLE rentalrecord
STORED AS AVRO
LOCATION '/user/dtom/store/data/rentalrecord'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema /rentalrecord.avsc');

CREATE EXTERNAL TABLE employee
STORED AS AVRO
LOCATION '/user/dtom/store/data/employee'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema/employee.avsc');

The rentalrecord.avsc is imported successfully, but employee.avsc fails on the first field definition. The field of type “RENTALRECORD”. The following error is output by Hive…

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered exception determining schema. Returning signal schema to indicate problem: "RENTALRECORD" is not a defined name. The type of the "rentalrecord" field must be a defined name or a {"type": ...} expression.)

My research tells me that the Avro files do support this form of schema resuse. So either I'm missing something or this is something that is not supported through Hive.

Any help would be greatly appreciated.

dtom
  • 11
  • 3
  • The property only points at one url... You need to embed all record typed objects in a single avsc – OneCricketeer Dec 16 '17 at 01:17
  • @cricket_007 thanks for the response. Hive appears to only handle one schema definition within an AVSC file. So consolidating the two avsc files into one results in (assuming I put the dependent record defintiion on top) a single Avro backed table with the fields defined in the first record...the subsequent record definition is ignored. – dtom Dec 18 '17 at 14:20
  • The definitions must be nested, not one defined on top of the other – OneCricketeer Dec 18 '17 at 14:28
  • @cricket_007 so essentially the way I had it to begin with in the first snippet of code. That I know works. What I am trying to determine is whether the method I want to move towards, to externalize the embedded definition so it can be reused as a "type" works. – dtom Dec 18 '17 at 14:37
  • I understand what you want, but your best option is to define AVDL, for example, which supports import statements, then use the avro tools jar file to generate the avsc and upload those to your cluster and build tables – OneCricketeer Dec 18 '17 at 14:44
  • 1
    OK got it...i just did a quick look at how AVDL files are formatted and the AVSC files that are generated from them. It looks like it just embeds these definitions that I want to externalize anyway. Regardless, the AVSC needs to have the definitions expanded. This makes your initial replies alot clearer. That's disappointing. LOL Thanks for the help. – dtom Dec 18 '17 at 15:08

1 Answers1

1

I have defined a AVDL with all references and then used the avro tools jar file with idl2schemata option to generate the avsc. The generated avsc worked like a charm with hive!!

UNayak
  • 11
  • 2