I want to run a Spark job, where each RDD is responsible for sending certain traffic over a network connection. The return value from each RDD is not very important, but I could perhaps ask them to return the number of messages sent. The important part is the network traffic, which is basically a side effect for running a function over each RDD.
Is it a good idea to perform the above task in Spark?
I'm trying to simulate network traffic from multiple sources to test the data collection infrastructure on the receiving end. I could instead manually setup multiple machines to run the sender, but I thought it'd be nice if I could take advantage of Spark's existing distributed framework.
However, it seems like Spark is designed for programs to "compute" and then "return" something, not for programs to run for their side effects. I'm not sure if this is a good idea, and would appreciate input from others.
To be clear, I'm thinking of something like the following
IDs = sc.parallelize(range(0, n))
def f(x):
for i in range(0,100):
message = make_message(x, i)
SEND_OVER_NETWORK(message)
return (x, 100)
IDsOne = IDs.map(f)
counts = IDsOne.reduceByKey(add)
for (ID, count) in counts.collect():
print ("%i ran %i times" % (ID, count))