I am new to Spark; looks awesome!
I have gobs of hourly logfiles from different sources, and wanted to create DStreams from them with a sliding window of ~5 minutes to explore correlations.
I'm just wondering what the best approach to accomplish this might be. Should I chop them up into 5-minute chunks in different directories? How would that naming structure be associated with a particular timeslice across different HDFS directories? Do I implement a filter() method that knows the log record's embedded timestamp?
suggestions, RTFMs welcomed.
thanks! Chris