3

I am desperately trying to make a simple program to load data from BigQuery to a Spark dataframe.

The Google's Dataproc pyspark example doesn't work, further I followed these links:

  1. BigQuery connector for pyspark via Hadoop Input Format example

  2. load table from bigquery to spark cluster with pyspark script

and now I see this error from google:

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [
    {
      "domain" : "global",
      "message" : "Required parameter is missing",
      "reason" : "required"
    }
  ],
  "message" : "Required parameter is missing"
}

I am not able to figure out what input parameters I am missing in my request, there is no clear documentation which talks about input parameters from pyspark perspective.

My code below:

import json
import pyspark

hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")

conf = {"mapred.bq.output.project.id": "test-project-id", "mapred.bq.gcs.bucket": "test-bucket",
    "mapred.bq.input.project.id": "publicdata", 
    "mapred.bq.input.dataset.id":"samples", 
    "mapred.bq.input.table.id": "shakespeare"  }

tableData = sc.newAPIHadoopRDD(
    "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
    "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", 
    conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
    int(x["word_count"]))).reduceByKey(lambda x,y: x+y)

print(tableData)
Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
gnanagurus
  • 883
  • 1
  • 12
  • 29

0 Answers0