creating Spark Dstreams from log archives

Question

I am new to Spark; looks awesome!

I have gobs of hourly logfiles from different sources, and wanted to create DStreams from them with a sliding window of ~5 minutes to explore correlations.

I'm just wondering what the best approach to accomplish this might be. Should I chop them up into 5-minute chunks in different directories? How would that naming structure be associated with a particular timeslice across different HDFS directories? Do I implement a filter() method that knows the log record's embedded timestamp?

suggestions, RTFMs welcomed.

thanks! Chris

score 0 · Answer 1 · edited May 23 '17 at 11:44

0

You can use apache Kafka as Dstream source and then you can try reduceByKeyAndWindow Dstream function. It will create a window according your required time

Trying to understand spark streaming windowing

edited May 23 '17 at 11:44

Community

1
1

answered Dec 02 '15 at 06:40

Kaushal

3,237
3
29
48

creating Spark Dstreams from log archives

1 Answers1