1

I would like to understand if the following would be a correct use case for Spark.

Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.

Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.

We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.

I am envisaging that it would work like this:

  1. Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
  2. Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
  3. The first function would be passed to the RDD, and would return a new RDD.
  4. The next function would then be run against the RDD output by the previous function.
  5. Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.

Does the above sound correct, or would this not be the right way to use Spark?

Thanks

user1052610
  • 4,440
  • 13
  • 50
  • 101
  • I have done something a little bit similar. Mainly, we developed a web scrapper, but we used Scala Akka framework. It is actor based framework, which provide concurency, scalability out of the box. There is MainActor on the top of the system, which fires the whole process. Creates MasterActor, which in turn feed workers with unit processes. The unit process was to process a single URL, to wipe out non-English data, and to store in Cassandra. The output was a Future object, and some callbacks on it. In general, actors send/receive only immutable messages. Your system looks like actor-based. – Szymon Roziewski Jan 26 '16 at 13:14
  • Thanks very much, but we are specifically considering Spark, and want to know if it would be a good fit, given the requirements. Spark gives a higher level of abstraction than Akka, – user1052610 Jan 26 '16 at 13:32
  • I know, but it is also possible to use both of them. – Szymon Roziewski Jan 26 '16 at 13:55

1 Answers1

1

We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database. We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB. A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.

what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.

I hope this helps a little. Good luck

z-star
  • 680
  • 5
  • 6