I want to see the data available in the spark streaming dataframe and later part I want to apply business operation on that data.
So far I have tried to convert streaming DataFrame to RDD. Once that object converted into RDD, I want to apply a function with transform the data and also create new column with the schema( for specific message).
dsraw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_kafka_server) \
.option("subscribe", topic) \
.load() \
.selectExpr("CAST(value AS STRING)")
print "type (df_stream)", type(dsraw)
print "schema (dsraw)", dsraw.printSchema()
def show_data_fun(dsraw, epoch_id):
dsraw.show()
row_rdd = dsraw.rdd.map(lambda row: literal_eval(dsraw['value']))
json_data = row_rdd.collect()
print "From rdd : ", type(json_data)
print "From rdd : ", json_data[0]
print "show_data_function_call"
jsonDataQuery = dsraw \
.writeStream \
.foreach(show_data_fun)\
.queryName("df_value")\
.trigger(continuous='1 second')\
.start()
print the first JSON message which is in the stream.