Spark sql saveAsTable create table append mode if new column is added in avro schema

Question

I am using Spark sql DataSet to write data into hive. Its working perfectly if schema is same but if I change the avro schema, adding new column in between, its showing the error (Schema is provided from schema registry)

Error running job streaming job 1519289340000 ms.0 org.apache.spark.sql.AnalysisException: The column number of the existing table default.sample(struct<collection_timestamp:bigint,managed_object_id:string,managed_object_type:string,if_admin_status:string,date:string,hour:int,quarter:bigint>) doesn't match the data schema(struct<collection_timestamp:bigint,managed_object_id:string,if_oper_status:string,managed_object_type:string,if_admin_status:string,date:string,hour:int,quarter:bigint>);

if_oper_status is the new column has to be added. Please suggest.

StructType struct = convertSchemaToStructType(SchemaRegstryClient.getLatestSchema("simple"));
        Dataset<Row> dataset = getSparkInstance().createDataFrame(newRDD, struct);


        dataset=dataset.withColumn("date",functions.date_format(functions.current_date(), "dd-MM-yyyy"));
        dataset=dataset.withColumn("hour",functions.hour(functions.current_timestamp()));
        dataset=dataset.withColumn("quarter",functions.floor(functions.minute(functions.current_timestamp()).divide(5)));


        dataset
        .coalesce(1)
        .write().mode(SaveMode.Append)
        .option("charset", "UTF8")
        .partitionBy("date","hour","quarter")
        .option("checkpointLocation", "/tmp/checkpoint")
        .saveAsTable("sample");

score 1 · Answer 1 · answered Mar 01 '18 at 06:12

I was able to solve this problem by saving the schema from registry into a file and providing the avro.schema.url = file path as below.

Note: This has to be done before saveAsTable("sample")

dataset.sqlContext().sql("CREATE EXTERNAL TABLE IF NOT EXISTS sample PARTITIONED BY (dt STRING, hour STRING, quarter STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://localhost:9000/user/root/sample'  TBLPROPERTIES ('avro.schema.url'='file://"+file.getAbsolutePath()+"')");

score 0 · Answer 2 · answered Feb 22 '18 at 10:13

0

Please refer the link: https://github.com/databricks/spark-avro/pull/155 . Per commit history, the PR to support evolving Avro schema has been added to release 3.1 . Whats the version of spark-avro are you using in your code?

answered Feb 22 '18 at 10:13

Vinoth Chinnasamy

450
2
12

Version 4.0.0 ---------- com.databricks spark-avro_2.11 4.0.0 – Sumit G Feb 22 '18 at 10:24
1

Just noticed, you are using schemaRegistry to get the Avro schema. I assumed you are reading Avro file using spark sql. I guess you are reading the Avro file from Kafka topic with schema Registry. I am not entirely sure whether the schema evolution is supported while using Kafka as source. I will do some digging on this and get back to you. – Vinoth Chinnasamy Feb 22 '18 at 11:22
did you get any clue? – Sumit G Feb 26 '18 at 07:28
Nope Not Yet sry. :( – Vinoth Chinnasamy Feb 27 '18 at 09:09

Spark sql saveAsTable create table append mode if new column is added in avro schema

2 Answers2