MQ to process, aggregate and publish data asynchronously

Question

Some background, before getting to the real question:

I am working on a back-end application that consists of several different modules. Each module is, currently, a command-line java application, which is run "on demand" (more details later).

Each module is a "step", part of a bigger process that you can think of as a data flow; the first step collects data files from an external source and pushes/loads them into some SQL database tables; then the following steps, based on different conditions and events (timing, presence of data in the DB, messages and elaborations done through a web-service/web-interface), take data from (1 or more) DB tables, process them, and write them down on different tables. Steps run on three different servers, and read data from three different DBs, but write only in a single DB. The purpose is to aggregate data, compute metrics and statistics.

Currently, each module is executed periodically (from a few minutes/hours for the first modules, to few days for the last in the chain, which need to aggregate more data and therefore wait "longer" from them to be available), using a cronjob. A module (currently, a java console application) is run, and it checks the database for new, unprocessed information in a given datetime-window, and do its job.

The problem: it works, but.. I need to expand and maintain it, and this approach is starting to show its limits.

I do not like to rely on "polling"; it is a waste, considering that the information of previous modules could be sufficient to "tell" other modules down the chain when the information they need is available, and that they can proceed.
It is "slow": the several days of delay for modules down the chain is there because we have to be sure data is arrived and processed by the previous modules. So we "stop" these modules until we are sure we have all the data. New additions require real-time (not hard, but "as soon as possible") computation of some metrics. A very good example is what happens here, on SO, with badges! :) I need to obtain something really similar.

To solve the second problem, I am going to introduce "partial", or "incremental" computations: as long as I have a set of relevant information, I process it. Then, when some other linked information arrives, I compute the difference and update the data accordingly, but then I need also to notify other (dependent) modules.

The question(s)

~~- 1) Which is the best way to do it? - 2) Related: which is the best way to "notify" other modules (java executables, in my case) that a relevant data is available?~~

I can see three ways:

add other, "non-data" tables to the DB, in which each module write "Hey, I have done this and it is available". When the cronjob starts another module, it read the table(s), decide that he can compute subset xxx, and does it. And so on
use Message Queues, like ZeroMQ, (or Apache Camel, like @mjn suggested) instead of DB tables
use a key-value store, like Redis, instead of DB tables

Edit: I am convinced that an approach based on queues is the way to go, I added the "table + polling" option for completeness but now I understand it is only a distraction (obviously, everyone is going to answer "yes, use queues, polling is evil" - and rightly so!). So let me rephrase the question to: What are the advantages/disadvantages of using a MQ over a key-value store with pub/sub like Redis?

3) are there any solution that help me in getting rid completely of the cronjobs?

Edit: in particular, in may case, it means: is there a mechanism in some MQ and/or key-value store that lets me publish messages with a "time"? Like "deliver it in 1 day"? With persistence and "almost once" delivery guarantee, obviously

4) should I build this message(event?)-based solution as a centralized service, running it as a daemon/service on one of the servers?
5) should I abandon this idea of starting the subscribers on demand, and have each module running continuous as a daemon/service?
6) which are the pro and cons (reliability, single point of failure vs. resource usage and complexity...)?

Edit: this is the bit I care about most: I would like to "queue" itself to activate "modules" based on messages in the queue, similar to MSMQ Activation. Is it a good idea? Is there anything in the Java world that does it, should I implement it myself (over an MQ or over Redis), or should I run each module as a daemon? (even if some computations typically happen in bursts, two hour long processing followed by two days of idling?)

NOTE: I cannot use heavy containers/EJB (No Glassfish or similar)

Edit: Camel as well seems a little too heavy for me. I'm looking for something really light here, both in terms of resources and complexity of development

In addition to the publisher/subscriber you may include in your evaluation also the actors model: http://akka.io/ — TizianoPiccardi, Mar 10 '13 at 21:52
looks like a pretty sick application and I know your pain. Probably not a solution for you, but I once managed to avoid those typical end of day/end of week batch jobs by calculating statistics and metrics on a fly using event processing engine Esper. You register a query whose result is recalculated each time an event arrives. Domain language is powerful and especially with time based data, which is usually a pain to do in SQL or even a procedural language. — MarianP, Mar 10 '13 at 22:13
@TizianoPiccardi, I will add your solution to the list of candidates. I have used actors in the past (not in java/scala, however) and I like the paradigm. Still, they are (after all) just another technology for event-based/message-driven computations, which is what to adopt. But it is not clear to me which one to choose, and how to apply it in particular. — Lorenzo Dematté, Mar 11 '13 at 08:05
@MarianP sound interesting.. Have you used it with your existing code or designed a new solution from 0 using it? Does it play nicely with java/scala? — Lorenzo Dematté, Mar 11 '13 at 08:07
I put it in right in the beginning from 0. But you might find migration easy, as I said the domain language is powerful (takes a bit learning and experimenting at first though). Esper is written in Java (there is also an equivalent .Net version), with proper api and unit testing support. The only problem I had was that I had to store all events in a db and replay them at start of a JVM in case this was restarted. However, there is also a paid version of the engine which should have support for this persistent stuff. — MarianP, Mar 11 '13 at 11:52
btw, it is super fast, I think I read that it is able to process 100k events with 500 registered queries in 1 second! You may not use Esper, there are other similar (paid) engines, but the online processing of events is a good paradigm IMO. — MarianP, Mar 11 '13 at 11:55
@dema80 I haven't read all of your question, but with an educated guess I would suggest this link: http://stackoverflow.com/questions/2635272/fastest-low-latency-method-for-inter-process-communication-between-java-and-c — mostruash, Mar 15 '13 at 06:02

mjn · Answer 1 · 2013-03-07T11:49:33.137

1

The queue task descriptions partially sound like things systems based on "enterprise integration patterns" like Apache Camel do.

A delayed message can be expressed by constants

from("seda:b").delay(1000).to("mock:result");

or variables, for example a message header value

from("seda:a").delay().header("MyDelay").to("mock:result");

edited Mar 07 '13 at 11:49

answered Mar 07 '13 at 11:41

mjn

36,362
28
176
378

Camel seems indeed like a valid alternative to "plain" MQs and/or home-grown ones with redis. Still, my other question(s) (where to place it/pros and cons/centralised vs. distributed) remain. (For example, does Camel support a distributed model? If so, how?) – Lorenzo Dematté Mar 07 '13 at 12:24

winash · Answer 2 · 2013-03-26T13:13:21.350

1> I suggest using a message queue, choose the queue depending on your requirements, but for most cases any one would do, I suggest you choose a queue based on protocol JMS (active mq) or AMQP (rabbit mq) and write a simple wrapper over it or use the ones provided by spring- > spring-jms or spring-amqp

2> You can write queue consumers such that they notify your system that a new message arrives for example in rabbit you can implement the MessageListener interface

 public class MyListener implements MessageListener {
     @Override
public void onMessage(Message message) {
     /* Handle the message */        

    }
}

3> If you use async consumers like in <2> you can get rid of all polling and cron jobs

4> Depends on your requirements -> If you have millions of events/messages passing through your queue then running the queue middle-ware on a centralized server makes sense.

5> If resource consumption is not an issue then keeping your consumers/subscribers running all the while is the easiest way to go. if these consumers are distributed then you can orchestrate them using a service like zookeeper

6> Scalability -> Most queuing systems provide for easy distribution of messages, so provided that your consumers are stateless, then scaling is possible just by adding new consumers and some configuration.

Thank you for your answer; I like the listener approach that MQs based on JMS allow. Still, I need more details: you say `keeping your consumers/subscribers running all the while is the easiest way to go`, but I worry here about availability (what if a subscriber crashes and I do not find out?) than resource consumption. — Lorenzo Dematté, Mar 25 '13 at 10:44
Also, I am still looking for the best way to do the "process/deliver this message in 2 hours" kind of job — Lorenzo Dematté, Mar 25 '13 at 10:45
Availability - You could use JMS or AMQP transactions and if the transaction fails you could be notified, you could also do this using failed acknowledgments. Delayed messages -> JMS does not have this but it should be easy to add this yourself, Rabbit MQ recently added this in an indirect way http://www.javacodegeeks.com/2012/04/rabbitmq-scheduled-message-delivery.html — winash, Mar 26 '13 at 13:10

score 0 · Accepted Answer · edited May 23 '17 at 12:03

After implementing it, I feel like answering my own question can be good for people that will come and visit StackOverflow in the future.

In the end, I went with Redis. It is really fast, and scalable. And I like its flexibility a lot: it is much more flexible than message queues. Am I asserting that Redis is better at MQs than the various MQs out there? Well, in my specific case I believe so. The point is: if something is not offered out-of-the-box, you can build it (usually, using MULTI - but you can even use LUA for more advance customization!).

For example, I followed this good answer to implement a "persistent", recoverable pub/sub (i.e. a pub/sub that allows clients to die and reconnect without losing messages).

This helped me with both my scalability and my "reliability" requirements: I decided to keep every piece in the pipeline independent (a deamon for now), but add a monitor which examines lists/queues on Redis; if something is not consumed (or consumed too slowly), the monitor spawns a new consumer. I am also thinking to be truly "elastic", and add the ability for consumers to kill themselves when there is no work to be done.

Another example: execution of scheduled activities. I am following this approach, which seems quite popular, for now. But I am eager to try keyspace notifications, to see if a combination of expiring keys and notifications can be a superior approach.

Finally, as a library to access Redis, my choice went to Jedis: it is popular, supported, and provides a nice interface to implement pub/sub as listeners. It is not the best approach (idiomatic) with Scala, but it works well.

MQ to process, aggregate and publish data asynchronously

The question(s)

3 Answers3