9

I'm working on an implementation of Spark Streaming in Scala where I am pull JSON Strings from a Kafka topic and want to load them into a dataframe. Is there a way to do this where Spark infers the schema on it's own from an RDD[String]?

zero323
  • 322,348
  • 103
  • 959
  • 935
masmithd
  • 101
  • 1
  • 4

4 Answers4

3

Yes, you can use the following:

sqlContext.read
//.schema(schema) //optional, makes it a bit faster, if you've processed it before you can get the schema using df.schema
.json(jsonRDD)  //RDD[String]

I'm trying to do the same at the moment. I'm curious how you got the RDD[String] out of Kafka though, I'm still under the impression Spark+Kafka only does streaming rather than "take out what's in there right now" one-off batch. :)

Kiara Grouwstra
  • 5,723
  • 4
  • 21
  • 36
2

In spark 1.4, you could try the following method to generate a Dataframe from rdd:

  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  val yourDataFrame = hiveContext.createDataFrame(yourRDD)
sparklearner
  • 403
  • 3
  • 7
  • 1
    This is similar as the following question: http://stackoverflow.com/questions/29383578/how-to-convert-rdd-object-to-dataframe-in-spark – sparklearner Jun 26 '15 at 17:57
1

You can use the below code to read in the stream of messages from Kafka, extract the JSON values and convert them to DataFrame:

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

messages.foreachRDD { rdd =>
//extracting the values only
  val df = sqlContext.read.json(rdd.map(x => x._2))
  df.show()
}
radek1st
  • 1,617
  • 17
  • 19
0

There is not schema inference on streaming. You can always read a file and pull the schema from it. You could also commit the file to version control and put it in a s3 bucket.

Brian
  • 848
  • 10
  • 32