14

I use spark streaming to receive twitts from twitter. I get many warning that says:

replicated to only 0 peer(s) instead of 1 peers

what is this warning for?

my code is:

    SparkConf conf = new SparkConf().setAppName("Test");
    JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(5));
    sc.checkpoint("/home/arman/Desktop/checkpoint");

    ConfigurationBuilder cb = new ConfigurationBuilder();
    cb.setOAuthConsumerKey("****************")
        .setOAuthConsumerSecret("**************")
        .setOAuthAccessToken("*********************")
        .setOAuthAccessTokenSecret("***************");


    JavaReceiverInputDStream<twitter4j.Status> statuses = TwitterUtils.createStream(sc, 
            AuthorizationFactory.getInstance(cb.build()));

    JavaPairDStream<String, Long> hashtags = statuses.flatMapToPair(new GetHashtags());
    JavaPairDStream<String, Long> hashtagsCount = hashtags.updateStateByKey(new UpdateReduce());
    hashtagsCount.foreachRDD(new saveText(args[0], true));

    sc.start();
    sc.awaitTerminationOrTimeout(Long.parseLong(args[1]));
    sc.stop();
Arman
  • 1,019
  • 2
  • 14
  • 33

1 Answers1

24

When reading data with Spark Streaming, incoming data blocks are replicated to at least one another node/worker because of fault-tolerance. Without that it may happen that in case the runtime reads data from stream and then fails this particular piece of data would be lost (it's already read and erased from stream and it's also lost at the worker side because of failure).

Referring to the Spark documentation :

While a Spark Streaming driver program is running, the system receives data from various sources and and divides it into batches. Each batch of data is treated as an RDD, that is, an immutable parallel collection of data. These input RDDs are saved in memory and replicated to two nodes for fault-tolerance.

The warning in your case means that incoming data from stream are not replicated at all. The reason for that may be that you run the app with just one instance of Spark worker or running in local mode. Try to start more Spark workers and see if the warning is gone.

vanekjar
  • 2,386
  • 14
  • 23
  • 2
    Is there a way to prevent these WARN from being output to the console? – Saqib Ali Sep 06 '16 at 03:37
  • 2
    To mute change log4j.rootCategory=WARN, console to log4j.rootCategory=ERROR, console in log4j.properties file – Saqib Ali Sep 09 '16 at 03:36
  • 2
    @SaqibAli That affects a lot of messages that we may not wish to suppress. A more targeted solution is `log4j.loggr.org.apache.spark.storage=ERROR` . There may be other *sub* packages under spark that should be muted - but preferably not *all* of them – WestCoastProjects Aug 07 '18 at 00:09
  • @javadba probably should be log4j.logger.org.apache.spark.storage=ERROR, i.e. 'logger' instead of 'loggr'. – abel Oct 29 '18 at 10:31
  • true - i had typed by hand instead of copy and paste. I can not fix the comment directly so here is the correct path: `log4j.logger.org.apache.spark.storage=ERROR` – WestCoastProjects Oct 29 '18 at 14:37