need some help on my first attempt to parse JSON coming on Kafka to Spark structured streaming.
I am struggling to convert the incoming JSON and covert it into flat dataframe for further processing.
My input json is
[
{ "siteId": "30:47:47:BE:16:8F", "siteData":
[
{ "dataseries": "trend-255", "values":
[
{"ts": 1502715600, "value": 35.74 },
{"ts": 1502715660, "value": 35.65 },
{"ts": 1502715720, "value": 35.58 },
{"ts": 1502715780, "value": 35.55 }
]
},
{ "dataseries": "trend-256", "values":
[
{"ts": 1502715840, "value": 18.45 },
{"ts": 1502715900, "value": 18.35 },
{"ts": 1502715960, "value": 18.32 }
]
}
]
},
{ "siteId": "30:47:47:BE:16:FF", "siteData":
[
{ "dataseries": "trend-255", "values":
[
{"ts": 1502715600, "value": 35.74 },
{"ts": 1502715660, "value": 35.65 },
{"ts": 1502715720, "value": 35.58 },
{"ts": 1502715780, "value": 35.55 }
]
},
{ "dataseries": "trend-256", "values":
[
{"ts": 1502715840, "value": 18.45 },
{"ts": 1502715900, "value": 18.35 },
{"ts": 1502715960, "value": 18.32 }
]
}
]
}
]
Spark schema is
data1_spark_schema = ArrayType(
StructType([
StructField("siteId", StringType(), False),
StructField("siteData", ArrayType(StructType([
StructField("dataseries", StringType(), False),
StructField("values", ArrayType(StructType([
StructField("ts", IntegerType(), False),
StructField("value", StringType(), False)
]), False), False)
]), False), False)
]), False
)
My very simple code is:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from config.general import kafka_instance
from config.general import topic
from schemas.schema import data1_spark_schema
spark = SparkSession \
.builder \
.appName("Structured_BMS_Feed") \
.getOrCreate()
stream = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_instance) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("max.poll.records", 100) \
.option("failOnDataLoss", False) \
.load()
stream_records = stream.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as bms_data1") \
.select(from_json("bms_data1", data1_spark_schema).alias("bms_data1"))
sites = stream_records.select(explode("bms_data1").alias("site")) \
.select("site.*")
sites.printSchema()
stream_debug = sites.writeStream \
.outputMode("append") \
.format("console") \
.option("numRows", 20) \
.option("truncate", False) \
.start()
stream_debug.awaitTermination()
When I run this code I schema is printing like this:
root
|-- siteId: string (nullable = false)
|-- siteData: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- dataseries: string (nullable = false)
| | |-- values: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- ts: integer (nullable = false)
| | | | |-- value: string (nullable = false)
Is it possible to have this schema in a way where I get all fields in a flat dataframe instead of nested JSON. So for every ts and value it should give me one row with its parent dataseries and site id.