Dynamically create Hive external table with Avro schema on Parquet Data

Question

I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.

My try is to use below DDL:

CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');

My Hive table is successfully created with the right schema, but when I try to read the data :

SELECT * FROM parquet_test;

I get the following error :

java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable

Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?

If the underlying data is in parquet format, I believe it is not possible to read using avro schema. It is more like reading french book with english dictionary in hand. — Ram Manohar, Dec 09 '15 at 14:57
Above error(Expecting a AvroGenericRecordWritable) indicates you are trying to read parquet format record with AvroSerde. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' tells hive to use AvroSerde to decode the data, but data is not in avro format. — Ram Manohar, Dec 09 '15 at 16:08
Yep, I'm trying to find an alternative solution, maybe I should generate a Hive query from an Avro schema. But I'm surprised that there is not a simpler solution. — tmouron, Dec 09 '15 at 16:27
I haven't tested but try this CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc'); CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath'; — Ram Manohar, Dec 09 '15 at 17:38
You have to use avro tools to convert the avro schema into a hive schema. Hive abstracts the parquet details. See https://developer.ibm.com/hadoop/blog/2015/11/10/use-parquet-tools-avro-tools-iop-4-1/ for avro tools. — Roberto Congiu, Dec 10 '15 at 04:43
Ram, your idea is great ! I just tried it and now I got : java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable — tmouron, Dec 10 '15 at 10:56
Roberto, thanks for a lot for your pointer. I'm trying to find out how I can convert a Avro schema into a Hive schema. Sounds like I can't do it directly, I should write some code for it, right ? — tmouron, Dec 10 '15 at 10:59
I have tested it works..! From the error message I think you avro schema file is messed up.. Try using parquet tools to regenerate the avro schema from parquet file. — Ram Manohar, Dec 10 '15 at 20:24
Possible duplicate of [Create Hive table to read parquet files from parquet/avro schema](https://stackoverflow.com/questions/34202743/create-hive-table-to-read-parquet-files-from-parquet-avro-schema) — eliasah, Aug 13 '19 at 09:53

score 12 · Accepted Answer · answered Dec 10 '15 at 17:13

12

Below query works:

CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); 

CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';

answered Dec 10 '15 at 17:13

Ram Manohar

1,004
8
18

I'm running Hive 1.1.0 and that doesn't work for me. I get `java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable` – tmouron Dec 11 '15 at 16:18
The error mesage indicates that a particular column is in text format in parquet file and in your avro schema it is defined as int. Hence the error. Change your avro schema of that particular column to String and retry. – Ram Manohar Dec 11 '15 at 17:41
I think my problem came from different and incompatible schemas. It's working fine now. Thank you ! – tmouron Dec 14 '15 at 14:15
Another solution is to create my own SerDe : what I **only** need from `AvroSerDe` is schema inference from an avro schema. So, I have to create a class that extends `org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe` and override method `public boolean shouldStoreFieldsInMetastore(Map tableParams)` using an `org.apache.hadoop.hive.serde2.avro.AvroSerde` object. This seems easy to write with Hive 1.2.0. – tmouron Dec 14 '15 at 14:21
I am geeting FAILED: ParseException line 1:55 missing EOF at 'STORED' near avro_test . Hive version -0.14. Runs well on hive 1.2 – pratim_b Nov 03 '16 at 06:46
do you know how to add a column using the avro.schema.literal to a table created in this fashion? – Alex Naspo Nov 17 '16 at 16:21

Dynamically create Hive external table with Avro schema on Parquet Data

1 Answers1

Linked