Unable to connect to BigQuery from GCP Dataproc's Jupyter notebook

Question

I am coding in the Jupyter notebook of GCP Dataproc. Below is the code I am having:

spark = SparkSession.builder.master("yarn") \
.appName('1.2. BigQuery Storage & Spark SQL - Python') \
.config('spark.jars','gs://dev-pysparkfiles/Spark-Bigquery-Connector-2.12-0.24.2.jar') \
.config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.24.2') \
.getOrCreate()

df = spark.read.format("com.google.cloud.spark.bigquery") \
                .option("materializationDataset", "ABC_HK_STG_TEMP_SIT") \
                .option("materializationExpirationTimeInMinutes", "1440") \
                .option("query", sql) \
                .load()

I tried with changing com.google.cloud.spark.bigquery to bigquery. But It is still giving me the error: java.lang.ClassNotFoundException: Failed to find data source: bigquery

I also removed the spark.jars.packages config while creating the spark session as I read in another stackoverflow answers. But I am still getting the same error.

https://stackoverflow.com/questions/73295856/errors-when-reading-and-writing-data-from-bigquery-using-pyspark — Dagang, Sep 07 '22 at 17:08

Unable to connect to BigQuery from GCP Dataproc's Jupyter notebook

0 Answers0