3

I`m trying to use the apache spark stream. I have one data source, csv file from HDFS.

I`m planning to do below things with Spark Stream:

  1. Read the CSV file periodically(5min) with textFileStream
  2. Split the DStream into multiple sub-dstream.

Below are a simple example about the requirement.

We got a CSV file in this format.

NAME, SCHOOL, GENDER, AGE, SUBJECT, SCORE
USR1, SCH001, male  , 28 , MATH   , 100  
USR2, SCH002, male  , 20 , MATH   , 99
USR1, SCH001, male  , 28 , ENGLISH, 80
USR8, SCHOO8, female, 20 , PHY    , 100

Every 5 min, I read a file like this, then I want to split this Input DStream into several subDStream. Each user will be one stream. Is it possible?

Community
  • 1
  • 1
Kramer Li
  • 2,284
  • 5
  • 27
  • 55
  • 1
    based on what would you want to split them? Although, similar to RDDs, I don't think it's possible – Mateusz Dymczyk Mar 14 '16 at 08:19
  • @MateuszDymczyk Multiple filters should be enough don't you think? – zero323 Mar 14 '16 at 08:22
  • 1
    @zero323 yeah, sorry for being not precise there, multiple filters should do the trick, doing it in one go in parallel is not supported, though, right? – Mateusz Dymczyk Mar 14 '16 at 08:24
  • 1
    @MateuszDymczyk Like yous said. RDDs don't support this and DStream is just a sequence of RDDs. It is possible to do something like [this](http://stackoverflow.com/a/32817565/1560062) but it works only based on an assumption that data fits into memory. And there is repartitioning and filtering partitions but this means full shuffle. – zero323 Mar 14 '16 at 08:29
  • Hi guys. I updated my requirement. Can you took a look please? – Kramer Li Mar 14 '16 at 08:43

1 Answers1

-1

My opinion, if you have a fixed interval time to collect your data, you don't need streaming features! Streaming is usefull when you don't know when your data arrives. But if the need of your job is a computation (real-time) of the (i.e.) cumulated score by user over the day/hour/etc, the streaming is your solution. The question is: do you want a photo on your file or film among multiple files?

The grouping by USR in the 2 use case is different, in case of streaming is more complicated. You have to consider what type of computation over the group and windowing/slides parameter. I suggest to see this

CarloV
  • 132
  • 1
  • 12