Does it make sense to run Spark job for its side effects?

Question

I want to run a Spark job, where each RDD is responsible for sending certain traffic over a network connection. The return value from each RDD is not very important, but I could perhaps ask them to return the number of messages sent. The important part is the network traffic, which is basically a side effect for running a function over each RDD.

Is it a good idea to perform the above task in Spark?

I'm trying to simulate network traffic from multiple sources to test the data collection infrastructure on the receiving end. I could instead manually setup multiple machines to run the sender, but I thought it'd be nice if I could take advantage of Spark's existing distributed framework.

However, it seems like Spark is designed for programs to "compute" and then "return" something, not for programs to run for their side effects. I'm not sure if this is a good idea, and would appreciate input from others.

To be clear, I'm thinking of something like the following

IDs = sc.parallelize(range(0, n))

def f(x):
    for i in range(0,100):
        message = make_message(x, i)
        SEND_OVER_NETWORK(message)
    return (x, 100)

IDsOne = IDs.map(f)
counts = IDsOne.reduceByKey(add)

for (ID, count) in counts.collect():
    print ("%i ran %i times" % (ID, count))

Although I understand every word in the first paragraph, and even the sentences are pretty clear, I don't understand why you're trying to do this. The context may help make this clearer. So, why are you doing this? What's the end goal? — Software Engineer, Oct 28 '15 at 16:51

zero323 · Accepted Answer · 2015-11-07T04:59:27.047

Generally speaking it doesn't make sense:

Spark is a heavyweight framework. At its core there is this huge machinery which ensures that data is properly distributed, collected, recovery is possible and so on. It has a significant impact on overall performance and latency but doesn't provide any benefits in case of side-effects-only tasks
Spark concurrency has a relatively low granularity with partition being the main unit of concurrency. At this level processing becomes synchronous. You cannot move on to the next partition before you finish the current one.

Lets say in your case there is a single slow SEND_OVER_NETWORK. If you use map you pretty much block processing on a whole partition. You can go at the lower level with mapPartitions, make SEND_OVER_NETWORK asynchronous, and return only when a whole partition has been processed. It is better but still suboptimal.

You can increase number of partitions, but it means higher bookkeeping overhead so at the end of the day you can make situation worse not better.
Spark API is designed mostly for side effects free operations. It makes it hard to express operations which doesn't fit into this model.

What is arguably more important is that Spark guarantees only that each operation is executed at-least-once (lets ignore zero-times if rdd is never materialized). If application requires for example exactly-once semantics things become tricky especially when you consider point 2.

It is possible to keep track of local state for each partition outside the main Spark logic but if you get there it is a really good sign that Spark is not the right tool.

Does it make sense to run Spark job for its side effects?

1 Answers1

Linked