4

We create a system consisting of multiple Spark Streaming applications with each applications having multiple receivers. As far as I understood, each receivers needs its own core in the cluster. We need multiple receivers to accommodate peaks but we don't need them all the time. The applications are quite small doing only one task in order to (re)submit them on the cluster without distracting the other jobs & tasks.

1) Assuming we have 5 jobs with 5 receivers each we would need at least 25 cores in the cluster only for the receivers to run + the cores for the processing. Is this right?

2) Is there a possibility to do a more dynamic resource allocation or is one core strictly bound to one receiver?

3) I took a look onto the spark-rest-server which offers the possibility to share spark context over different jobs. Could you think of having one SparkStreamingContext for all (~100) Jobs?

We are running the cluster in standalone mode together with a Cassandra cluster on the same nodes.

mniehoff
  • 507
  • 1
  • 5
  • 15

1 Answers1

2
  1. If you run 5 distinct Spak Applications, each having 5 receivers, yes, data ingestion will consume 5x5=25 cores. However, have you looked at receiver-less approaches ? ( § 2 of https://spark.apache.org/docs/latest/streaming-kafka-integration.html )
  2. Spark has dynamic allocation capabilities on Yarn and on Mesos, but this concerns executors, not receivers.
  3. Pipelining data within a smaller (number of) application(s) seems to make sense : if you have ~100 applications that each do simple ETL, it's probable that starting and sccheduling those applications take more time than running the crunching they actually do. I could be wrong on this, but then you'd have ot be more specific about what they do (perhaps in another SO question, after you've benchmarked a bit?)
Francois G
  • 11,957
  • 54
  • 59
  • thanks for response. 1. Direct Approach would be an option, if we would use Kafka ;-) 2. I learnt (after asking here) that dynamic allocation is not yet supported for Spark Streaming. At least not the automatic scaling, but only with implementing it your self. 3. We are using multiple threads now in one Spark Application. That looks ok so far, but we are still having are fews issues with the overhead of starting tasks. This needs some optimization. – mniehoff Nov 23 '15 at 15:43