1

I am quite new on Spark streaming, and I am stuck saving my output.

My question is, how can I save the output of my JavaPairDStream in a text file, which is updated for each file only with the elements inside the DStream?

For example, with the wordCount example,

JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
      new PairFunction<String, String, Integer>() {
        @Override
        public Tuple2<String, Integer> call(String s) {
          return new Tuple2<>(s, 1);
        }
      }).reduceByKey(new Function2<Integer, Integer, Integer>() {
        @Override
        public Integer call(Integer i1, Integer i2) {
          return i1 + i2;
        }
      });

I would get the following output using wordCounts.print(),

(Hello,1)
(World,1)

I would like to write the last lines into a text file, which is refreshed each batch with the contents of wordCounts.

I've tried the following approach,

mappedRDD.dstream().saveAsTextFiles("output","txt");

This is generating a bunch of directories with several senseless files each batch time.

Another approach would be,

mappedRDD.foreachRDD(new Function2<JavaPairDStream<String, Integer>, Time, Void>() {
            public Void Call(JavaPairDStream<String, Integer> rdd, Time time)
            {
                //Something over rdd to save its content on a file???

                return null;
            }
        });

I would appreciate some help.

Thank you

abaghel
  • 14,783
  • 2
  • 50
  • 66
Luis_MG
  • 65
  • 7

1 Answers1

3

You can do it like below. Here is SO post related to saveAsTextFile outputs multiple files.

 wordCounts.foreachRDD(rdd ->{
          if(!rdd.isEmpty()){
             rdd.coalesce(1).saveAsTextFile("c:\\temp\\count\\");
          }
      });
Community
  • 1
  • 1
abaghel
  • 14,783
  • 2
  • 50
  • 66