GCP Dataproc spark consuming BigQuery

Question

I'm very new on GCP Google Cloud Platform, so I hope my question will not look so silly.

Footstage:

The main goals is gather few extend tables from BigQuery and apply few transformations. Because of the size of the tables I'm planning use Dataproc deploying a Pyspark script, ideally I would be able to use sqlContext to apply few sql queries to the DFs (tables pulled from BQ). Finally, I could easily dump this info into a file within a data storage bucket.

Questions :

Can I use import google.datalab.bigquery as bq within my Pyspark script?
Is this proposed schema the most efficient or instead I might validate any other? keep in mind that I need to create many temporal queries and this is why I though on Spark.
I expect to use pandas and bq to read the results queries as pandas df following this example. Later, I might use sc.parallelize from Spark to transform the pandas df into a spark df. Is this approach the right one?

my script

Update:

After have a back and forth with @Tanvee that kindly attend this question we conclude that GCP requires an intermediate allocation step when you need to read data from DataStorage into your Dataproc. Briefly, your spark or hadoop script might need a temporal bucket where store the data from the table and then bring it into Spark.

References:

Big Query Connector \ Deployment

thanks so much

score 2 · Answer 1 · answered Jan 07 '19 at 19:00

2

You will need to use BigQuery connector for spark. There are some examples in the GCP documentation here and here. It will create RDD which you can convert to dataframe and then you will be able to perform all typical transformations. Hope that helps.

answered Jan 07 '19 at 19:00

Tanveer Uddin

1,520
9
15

thanks @Tanveer for your answer, let me ask you. If I didnt get it wrongly Pyspark will use a data storage bucket to allocate all the data required from my query and then will reload the data into the cluster? I was thinking that perhaps BigQuery spans as data catalog to dataproc cluster. In this case so if my BigQuery table is so big I dont know how efficient will be download it to a bucket and then load into my cluster. Can you please confirm if I'm correct? thanks – Andres Urrego Angel Jan 07 '19 at 19:35
Hi Andres, no. You dont need to put data into bucket. You can directly read data from bigquery table from your spark code. Please have a look at the example link i have given. Please feel free to ask if you are having any issues with connecting to bigquery from spark following the example code. – Tanveer Uddin Jan 07 '19 at 20:16
Tranveer I need to keep the query results in memory to create a new temporary view where I expect to execute few queries. Apparently I can't keep the result of a BQ join query in memory for that purpose. thanks for confirming – Andres Urrego Angel Jan 07 '19 at 20:20
yes it will be in-memory. Once you get your dataframe, use df.createOrReplaceTempView("mySparkInmemoryTable"). Then you can use spark sql to query it in several ways. Once you are done, you can output to GCS bucket or BigQuery etc. – Tanveer Uddin Jan 07 '19 at 21:09
I have added my script the one I have right now reading this public query. Following the examples you posted is required defined a destination bucket and an input bucket what is the purpose of that if I just need to read the query outcome in memory? btw the script is not running :( lol – Andres Urrego Angel Jan 08 '19 at 02:26
Sorry for the confusion. The connector will use the bucket as the staging area. Then once you generate Spark dataframe, it will be Sparks execution model. I think you can't avoid it at this moment. https://stackoverflow.com/questions/41462141/read-from-bigquery-into-spark-in-efficient-way Regarding the script not running, what error you are getting? – Tanveer Uddin Jan 08 '19 at 04:46
So briefly yes top get the data from BigQuery in Dataproc the process will first allocate it in buckets and then load it in Spark right? – Andres Urrego Angel Jan 08 '19 at 13:30
Hi Andres, thats right. If writing to GCS bucket is absolutely no no for you, you can also look at the streaming read using https://github.com/samelamin/spark-bigquery. persoanlly I did not try it myself. Or alternatively you can go completely serverless using Dataflow (Beam) which is a different learning curve, but worth the effort if you are focusing on GCP – Tanveer Uddin Jan 08 '19 at 23:30
I found something great using a pandas framework called `pandas_gbq` and `pandasql` with these two I could do what I pretend to achieve with Dataproc. @Tanveer thanks so much for you help and support :) I will review the documentation shared. thanks – Andres Urrego Angel Jan 09 '19 at 15:36

score 0 · Answer 2 · answered May 15 '19 at 15:47

You can directly use following options to connect bigquery table from spark.

You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly run your queries on dataproc using spark.
https://github.com/GoogleCloudPlatform/spark-bigquery-connector This is new connector which is in beta. This is spark datasource api to bigquery which is easy to use.

Please refer following link: Dataproc + BigQuery examples - any available?

GCP Dataproc spark consuming BigQuery

2 Answers2