How to read data from HDFS using spark streaming?

Question

JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(1000));

My HDFS directory contains json files

I hope this stackoverflow question helps you https://stackoverflow.com/questions/27478096/cannot-read-a-file-from-hdfs-using-spark — Sateesh Telaprolu, Mar 12 '18 at 06:17
http://idownvotedbecau.se/noattempt/ ... https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources — OneCricketeer, Mar 12 '18 at 09:07

koiralo · Answer 1 · 2018-03-12T06:09:18.667

2

You can use textFileStream to read it as a text file and convert it later.

val dstream = ssc.textFileStream("path to hdfs directory")

This gives you DStream[Strings] which is a collection of RDD[String]

Then you can get an RDD for each interval of time as

dstream.foreachRDD(rdd => {
  //now apply a transformation or anything with the each rdd
 spark.read.json(rdd) // to change it to dataframe
})

scc.start()             // Start the computation
ssc.awaitTermination()   // Wait for the computation to terminate

Hope this helps

edited Mar 12 '18 at 06:09

answered Mar 12 '18 at 06:03

koiralo

22,594
6
51
72

1

I am sorry I am not so good in java, But won't be much difference with compared to this one. – koiralo Mar 12 '18 at 06:12

How to read data from HDFS using spark streaming?

1 Answers1