PySpark parse Json using RDD and json.load

Question

{

  "city": "Tempe",
  "state": "AZ",
  ...
  "attributes": [
    "BikeParking: True",
    "BusinessAcceptsBitcoin: False",
    "BusinessAcceptsCreditCards: True",
    "BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
    "DogsAllowed: False",
    "RestaurantsPriceRange2: 2",
    "WheelchairAccessible: True"
  ],
  ...
}

Hello, I am using PySpark and I am trying to output a tuple of (state, BusinessAcceptsBitcoin), currently I am doing:

csr = (dataset
        .filter(lambda e:"city" in e and "BusinessAcceptsBitcoin" in e)
        .map(lambda e: (e["city"],e["BusinessAcceptsBitcoin"]))
        .collect()
        )

But this command fails. How can I get the "BusinessAcceptsBitcoin" and "city" fields?

[How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/8371915) — Alper t. Turker, Feb 08 '18 at 12:05
Best guess it it is a duplicate of [Read multiline JSON in Apache Spark](https://stackoverflow.com/q/38545850/8371915) — Alper t. Turker, Feb 08 '18 at 12:06

cts_superking · Answer 1 · 2018-02-08T06:30:15.740

You can use Dataframe and UDF to parse the 'attributes' string.

From the sample data you have given, 'attributes' doesn't seem to be a proper JSON or Dict.

Assuming 'attributes' is just a string, here is a sample code using dataframe and Udf.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession \
            .builder \
            .appName("test") \
            .getOrCreate()

#sample data
data=[{

  "city": "Tempe",
  "state": "AZ",
  "attributes": [
    "BikeParking: True",
    "BusinessAcceptsBitcoin: False",
    "BusinessAcceptsCreditCards: True",
    "BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
    "DogsAllowed: False",
    "RestaurantsPriceRange2: 2",
    "WheelchairAccessible: True"
  ]
}]
df=spark.sparkContext.parallelize(data).toDF()

User defined function to parse the string

def get_attribute(data,attribute):
    return [list_item for list_item in data if attribute in list_item][0]

register udf

udf_get_attribute=udf(get_attribute, StringType

Dataframe

df.withColumn("BusinessAcceptsBitcoin",udf_get_attribute("attributes",lit("BusinessAcceptsBitcoin"))).select("city","BusinessAcceptsBitcoin").show(truncate=False)

Sample output

+-----+-----------------------------+
|city |BusinessAcceptsBitcoin       |
+-----+-----------------------------+
|Tempe|BusinessAcceptsBitcoin: False|
+-----+-----------------------------+

you can use the same udf to query any other field too, for example

df.withColumn("DogsAllowed",udf_get_attribute("attributes",lit("DogsAllowed"))).select("city","DogsAllowed").show(truncate=False)

sorry but I can't use dataframe for this!! Have to be RDD only! — Michael Chang, Feb 08 '18 at 18:26

PySpark parse Json using RDD and json.load

1 Answers1