0

I am trying to connect bigquery using databricks latest version(7.1+, spark 3.0) with pyspark as script editor/base language.

We ran a below pyspark script to fetch data from bigquery table to databricks

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .appName('bq')
    .master('local[4]')
    .config('parentProject', 'google-project-ID')
    .config('spark.jars', 'jarlocation.jar') \
    .getOrCreate()
)
df = spark.read.format("bigquery").option("credentialsFile", "file path") \
  .option("parentProject", "google-project-ID") \
  .option("project", "Dataset-Name") \
  .option("table","dataset.schema.tablename") \
  .load()

After running the script, when we tried to view the data, we were able to fetch the data in nested format.

{"visitId":"1607519947"},
{"visitStartTime":"1607519947"},
{"date":"20201209"},
{"totals":{"visits": 1, "hits": 1, "pageviews": 1, "timeOnSite": null, "bounces": 1, "transactions": null, "transactionRevenue": null, "newVisits": 1, "screenviews": null, "uniqueScreenviews": null, "timeOnScreen": null, "totalTransactionRevenue": null, "sessionQualityDim": 0}},
{"hits": [{"hitNumber": 1, "time": 0, "hour": 14, "minute": 19, "isExit": true, "referer": null, 
"page": {"pagePath": "/nieuws/Post-hoc-analyse-naar-KRd-bij-18-maanden-na-randomisatie", "hostname": "www.amgenhematologie.nl", "pagePathLevel4": ""}, 
"transaction": {"transactionId": null, "transactionRevenue": null, "transactionTax": null, "transactionShipping": null, "affiliation": null},
"item": {"transactionId": null, "productName": null, "productCategory": null, "productSku": null, "itemQuantity": null, "itemRevenue": null, "currencyCode": "(not set)", "localItemRevenue": null}, 
"eventInfo": null, 
"product": [], 
"promotion": [], 
"promotionActionInfo": null, "refund": null, 
"eCommerceAction": {"action_type": "0", "step": 1, "option": null}, 
"experiment": [], 
"publisher": null, 
"customVariables": [], 
"customDimensions": [], 
"customMetrics": [], 
"type": "PAGE", 
"social": {"socialInteractionNetwork": null, "socialInteractionAction": null, "socialInteractions": null, "socialInteractionTarget": null, "socialNetwork": "(not set)", "uniqueSocialInteractions": null, "hasSocialSourceReferral": "No", "socialInteractionNetworkAction": " : "}, 
"dataSource": "web", 
"publisher_infos": []}]}

Sample Nested Dataframe

Above is the sample data for nested data format.

In this, first 3 columns visitId, visitStartTime and date are direct columns

4th column Totals is in nested format, which needs unnesting in a format like totals.visits, totals.hits etc, as a separate column header like 1st 3 columns with its value

Same goes to 5th column, which has multiple nested dictionaries and should unnest each column inside dictionary as a separate column header which I mentioned for above 4th column.

Is there anyway to unnest the data in pyspark while reading the data from bigquery directly ?

Helps will be appreciated. Thanks in Advance!

Naveen B
  • 113
  • 3
  • 12
Harini
  • 21
  • 4
  • what is nested format? – mck Dec 15 '20 at 13:51
  • Nested format is Nested dictionary. One column will have dictionary values, dict contains multiple key value pairs which should be expanded as the separate columns. – Harini Dec 16 '20 at 06:38
  • Could you provide an example of what you want to unnest, and how you want to unnest it? – mck Dec 16 '20 at 06:39
  • I edited the question, added the sample data and my requirement. – Harini Dec 16 '20 at 07:27
  • See the answers in https://stackoverflow.com/questions/34271398/flatten-nested-spark-dataframe – mck Dec 16 '20 at 07:41
  • This actually helps to unnest the columns in dataframe level once we fetch the data from bgquery. But we are trying to fetch the columns in an unnested way directly from bigquery. – Harini Dec 16 '20 at 08:03
  • Added sample spark dataframe image in the question which I need every dictionary key value pair as the individual column while fetching from bigquery directly. – Harini Dec 16 '20 at 08:05

0 Answers0