I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this:
val df1 = df0
.groupBy(
colKey,
colTimestamp
)
.agg(
collect_list(
struct(
colCreationTimestamp,
colRecordId
)
).as("Records")
)
But i am getting this error at runtime:
Error
Caused by: org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
- Provided key schema: StructType(StructField(Key,StringType,true), StructField(Timestamp,TimestampType,true)
- Provided value schema: StructType(StructField(buf,BinaryType,true))
- Existing key schema: StructType(StructField(_1,StringType,true), StructField(_2,TimestampType,true))
- Existing value schema: StructType(StructField(buf,BinaryType,true))
If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false.
Please note running query with incompatible schema could cause indeterministic behavior.
at org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityChecker.check(StateSchemaCompatibilityChecker.scala:60)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$2(StateStore.scala:487)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$1(StateStore.scala:487)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
The exception doesnt contains the exact line number to reference my code, so i narrowed down to this code based on the provided key schema columns, and also if i change the groupBy key columns the error changes accordingly.
I tried different things like explicit df0.select() before group by for the required column to ensure that incoming data had the given column. but got the same error. can someone suggest how its picking the Existing key schema, or what should i look for to resolve this?
update [Solved for me]
While uploading the records to eventHub, EventHubSpark library stores the states in checkpoint directory, where it had old state and causing the StateSchemaNotCompatible issue, pointing to new Checkpoint dir solved the issue for me.