Connect GCP with PySpark without using Dataproc

Question

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks

score 0 · Accepted Answer · answered Oct 31 '19 at 23:18

0

Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.

In short, you should run

spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()

Please refer here on how to create the json credentials file if needed.

answered Oct 31 '19 at 23:18

David Rabinowitz

29,904
14
93
125

The spark. is SparkSession, SparkContext or SQLContext? – KKK Nov 01 '19 at 09:27
SparkSession. The convention for SparkContext is sc – David Rabinowitz Nov 01 '19 at 22:20
Can I have a full sample code for how to begin and call for data, please? Thanks – KKK Nov 19 '19 at 12:04
https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/examples/python/shakespeare.py – David Rabinowitz Nov 19 '19 at 16:22

score 0 · Answer 2 · answered Nov 20 '19 at 16:38

The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:

Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit

Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by @ David Rabinowitz

Connect GCP with PySpark without using Dataproc

2 Answers2

Linked