I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks
Asked
Active
Viewed 1,635 times
2 Answers
0
Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.
In short, you should run
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()
Please refer here on how to create the json credentials file if needed.

David Rabinowitz
- 29,904
- 14
- 93
- 125
-
The spark. is SparkSession, SparkContext or SQLContext? – KKK Nov 01 '19 at 09:27
-
SparkSession. The convention for SparkContext is sc – David Rabinowitz Nov 01 '19 at 22:20
-
Can I have a full sample code for how to begin and call for data, please? Thanks – KKK Nov 19 '19 at 12:04
-
https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/examples/python/shakespeare.py – David Rabinowitz Nov 19 '19 at 16:22
0
The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:
- Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
- Add the connector only to your Spark applications, for example with the
--jars
option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit
Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by @ David Rabinowitz

rsantiago
- 2,054
- 8
- 17