java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

Question

I am trying to read data from hudi but getting below error

Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am able to read the data from Hudi using my jupyter notebook using below commands

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
    "spark.sql.catalogImplementation", "hive"
).config(
    "spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).enableHiveSupport().getOrCreate




import org.apache.hudi.DataSourceReadOptions
val hudiIncQueryDF = spark.read.format("hudi").load(
    "path"
)

import org.apache.spark.sql.functions._
hudiIncQueryDF.filter(col("column_name")===lit("2022-06-01")).show(10,false)

This jupyter notebook was opened using a cluster which was created with one of the below properties

--properties spark:spark.jars="gs://rdl-stage-lib/hudi-spark3-bundle_2.12-0.10.0.jar" \

however, when I try to run the job using spark-submit with the same cluster, I get the error above. I have also added spark.serializer=org.apache.spark.serializer.KryoSerializer in my job properties. Not sure what's the issue.

score 0 · Answer 1 · answered Sep 03 '22 at 19:27

As your application is dependent on hudi jar, hudi itself has some dependencies, when you add the maven package to your session, spark will install hudi jar and its dependencies, but in your case, you provide only the hudi jar file from a GCS bucket.

You can try this property instead:

--properties spark:spark.jars.packages="org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0" \

Or directly from you notebook:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.config(
    "spark.sql.catalogImplementation", "hive"
).config(
    "spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config(
    "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog"
).config(
    "spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
).config(
    "spark.jars.package", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0"
).enableHiveSupport().getOrCreate

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

1 Answers1