Loading datasets from Google BigQuery to Google Dataproc using PySpark

Question

I am desperately trying to make a simple program to load data from BigQuery to a Spark dataframe.

The Google's Dataproc pyspark example doesn't work, further I followed these links:

and now I see this error from google:

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [
    {
      "domain" : "global",
      "message" : "Required parameter is missing",
      "reason" : "required"
    }
  ],
  "message" : "Required parameter is missing"
}

I am not able to figure out what input parameters I am missing in my request, there is no clear documentation which talks about input parameters from pyspark perspective.

My code below:

import json
import pyspark

hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")

conf = {"mapred.bq.output.project.id": "test-project-id", "mapred.bq.gcs.bucket": "test-bucket",
    "mapred.bq.input.project.id": "publicdata", 
    "mapred.bq.input.dataset.id":"samples", 
    "mapred.bq.input.table.id": "shakespeare"  }

tableData = sc.newAPIHadoopRDD(
    "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
    "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", 
    conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
    int(x["word_count"]))).reduceByKey(lambda x,y: x+y)

print(tableData)

I re-ran your code with my values to test it and it worked. Have you happened to resolve the error? :) — Dominique Paul, Oct 21 '18 at 11:16
OP, please try using the latest connector version. It seems working pretty well. — Alan Borsato, Jan 15 '20 at 19:33

Loading datasets from Google BigQuery to Google Dataproc using PySpark

0 Answers0