I am trying to read data from hudi but getting below error
Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
I am able to read the data from Hudi using my jupyter notebook using below commands
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).enableHiveSupport().getOrCreate
import org.apache.hudi.DataSourceReadOptions
val hudiIncQueryDF = spark.read.format("hudi").load(
"path"
)
import org.apache.spark.sql.functions._
hudiIncQueryDF.filter(col("column_name")===lit("2022-06-01")).show(10,false)
This jupyter notebook was opened using a cluster which was created with one of the below properties
--properties spark:spark.jars="gs://rdl-stage-lib/hudi-spark3-bundle_2.12-0.10.0.jar" \
however, when I try to run the job using spark-submit with the same cluster, I get the error above. I have also added spark.serializer=org.apache.spark.serializer.KryoSerializer in my job properties. Not sure what's the issue.