0

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
KKK
  • 27
  • 7

2 Answers2

0

Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.

In short, you should run

spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()

Please refer here on how to create the json credentials file if needed.

David Rabinowitz
  • 29,904
  • 14
  • 93
  • 125
0

The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:

  • Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
  • Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit

Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by @ David Rabinowitz

rsantiago
  • 2,054
  • 8
  • 17