How to convert spark streaming nested json coming on kafka to flat dataframe?

Question

need some help on my first attempt to parse JSON coming on Kafka to Spark structured streaming.

I am struggling to convert the incoming JSON and covert it into flat dataframe for further processing.

My input json is

[
    { "siteId": "30:47:47:BE:16:8F", "siteData": 
        [
            { "dataseries": "trend-255", "values": 
                [
                    {"ts": 1502715600, "value": 35.74 },
                    {"ts": 1502715660, "value": 35.65 },
                    {"ts": 1502715720, "value": 35.58 },
                    {"ts": 1502715780, "value": 35.55 }
                ]
            },
            { "dataseries": "trend-256", "values":
                [
                    {"ts": 1502715840, "value": 18.45 },
                    {"ts": 1502715900, "value": 18.35 },
                    {"ts": 1502715960, "value": 18.32 }
                ]
            }
        ]
    },
    { "siteId": "30:47:47:BE:16:FF", "siteData": 
        [
            { "dataseries": "trend-255", "values": 
                [
                    {"ts": 1502715600, "value": 35.74 },
                    {"ts": 1502715660, "value": 35.65 },
                    {"ts": 1502715720, "value": 35.58 },
                    {"ts": 1502715780, "value": 35.55 }
                ]
            },
            { "dataseries": "trend-256", "values":
                [
                    {"ts": 1502715840, "value": 18.45 },
                    {"ts": 1502715900, "value": 18.35 },
                    {"ts": 1502715960, "value": 18.32 }
                ]
            }
        ]
    }
]

Spark schema is

data1_spark_schema = ArrayType(
StructType([
  StructField("siteId", StringType(), False),
  StructField("siteData", ArrayType(StructType([
    StructField("dataseries", StringType(), False),
    StructField("values", ArrayType(StructType([
      StructField("ts", IntegerType(), False),
      StructField("value", StringType(), False)
    ]), False), False)
  ]), False), False)
]), False
)

My very simple code is:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

from config.general import kafka_instance
from config.general import topic
from schemas.schema import data1_spark_schema

spark = SparkSession \
            .builder \
            .appName("Structured_BMS_Feed") \
            .getOrCreate()

stream = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", kafka_instance) \
        .option("subscribe", topic) \
        .option("startingOffsets", "latest") \
        .option("max.poll.records", 100) \
        .option("failOnDataLoss", False) \
        .load()

stream_records = stream.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as bms_data1") \
                       .select(from_json("bms_data1", data1_spark_schema).alias("bms_data1"))

sites = stream_records.select(explode("bms_data1").alias("site")) \
                      .select("site.*")

sites.printSchema()

stream_debug = sites.writeStream \
                             .outputMode("append") \
                             .format("console") \
                             .option("numRows", 20) \
                             .option("truncate", False) \
                             .start()


stream_debug.awaitTermination()

When I run this code I schema is printing like this:

root
 |-- siteId: string (nullable = false)
 |-- siteData: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- dataseries: string (nullable = false)
 |    |    |-- values: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- ts: integer (nullable = false)
 |    |    |    |    |-- value: string (nullable = false)

Is it possible to have this schema in a way where I get all fields in a flat dataframe instead of nested JSON. So for every ts and value it should give me one row with its parent dataseries and site id.

Seems duplicate question: [refer this]: (https://stackoverflow.com/questions/35027966/elegant-json-flatten-in-spark) — Puneet Singh, Sep 14 '17 at 08:57

score 3 · Answer 1 · answered Sep 14 '17 at 16:02

Answering my own question. I managed to flatten it using following lines:

sites_flat = stream_records.select(explode("bms_data1").alias("site")) \
                           .select("site.siteId", explode("site.siteData").alias("siteData")) \
                           .select("siteId", "siteData.dataseries", explode("siteData.values").alias("values")) \
                           .select("siteId", "dataseries", "values.*")

How to convert spark streaming nested json coming on kafka to flat dataframe?

1 Answers1