I`m trying to use the apache spark stream. I have one data source, csv file from HDFS.
I`m planning to do below things with Spark Stream:
- Read the CSV file periodically(5min) with textFileStream
- Split the DStream into multiple sub-dstream.
Below are a simple example about the requirement.
We got a CSV file in this format.
NAME, SCHOOL, GENDER, AGE, SUBJECT, SCORE
USR1, SCH001, male , 28 , MATH , 100
USR2, SCH002, male , 20 , MATH , 99
USR1, SCH001, male , 28 , ENGLISH, 80
USR8, SCHOO8, female, 20 , PHY , 100
Every 5 min, I read a file like this, then I want to split this Input DStream into several subDStream. Each user will be one stream. Is it possible?