You can use Dataframe and UDF to parse the 'attributes' string.
From the sample data you have given, 'attributes' doesn't seem to be a proper JSON or Dict.
Assuming 'attributes' is just a string, here is a sample code using dataframe and Udf.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("test") \
.getOrCreate()
#sample data
data=[{
"city": "Tempe",
"state": "AZ",
"attributes": [
"BikeParking: True",
"BusinessAcceptsBitcoin: False",
"BusinessAcceptsCreditCards: True",
"BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
"DogsAllowed: False",
"RestaurantsPriceRange2: 2",
"WheelchairAccessible: True"
]
}]
df=spark.sparkContext.parallelize(data).toDF()
User defined function to parse the string
def get_attribute(data,attribute):
return [list_item for list_item in data if attribute in list_item][0]
register udf
udf_get_attribute=udf(get_attribute, StringType
Dataframe
df.withColumn("BusinessAcceptsBitcoin",udf_get_attribute("attributes",lit("BusinessAcceptsBitcoin"))).select("city","BusinessAcceptsBitcoin").show(truncate=False)
Sample output
+-----+-----------------------------+
|city |BusinessAcceptsBitcoin |
+-----+-----------------------------+
|Tempe|BusinessAcceptsBitcoin: False|
+-----+-----------------------------+
you can use the same udf to query any other field too, for example
df.withColumn("DogsAllowed",udf_get_attribute("attributes",lit("DogsAllowed"))).select("city","DogsAllowed").show(truncate=False)