0

I am new to Spark; looks awesome!

I have gobs of hourly logfiles from different sources, and wanted to create DStreams from them with a sliding window of ~5 minutes to explore correlations.

I'm just wondering what the best approach to accomplish this might be. Should I chop them up into 5-minute chunks in different directories? How would that naming structure be associated with a particular timeslice across different HDFS directories? Do I implement a filter() method that knows the log record's embedded timestamp?

suggestions, RTFMs welcomed.

thanks! Chris

1 Answers1

0

You can use apache Kafka as Dstream source and then you can try reduceByKeyAndWindow Dstream function. It will create a window according your required time

Trying to understand spark streaming windowing

Community
  • 1
  • 1
Kaushal
  • 3,237
  • 3
  • 29
  • 48