Currently I am working on a project where I have setup a Storm cluster across four Unix hosts.
The topology itself is as follows:
- JMS Spout listens to an MQ for new messages
- JMS Spout parses and then emits the result to an Esper Bolt
- The Esper Bolt then processes the event and emits a result to a JMS Bolt
- The JMS Bolt then publishes the message back onto the MQ on a different topic
I realize that Storm is a "at least-once" framework. However, if I receive 5 events and pass these onto the Esper Bolt for counting then for some reason I am receiving 5 count results in the JMS Bolt(all the same value).
Ideally, I want to receive one result output, is there some way I can tell Storm to ignore duplicate tuples?
I think this has something to do with the parallelism that I have setup because it works as expected when I just have a single thread:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(JMS_DATA_SPOUT, new JMSDataSpout(),2).setNumTasks(2);
builder.setBolt("esperBolt", new EsperBolt.Builder().build(),6).setNumTasks(6)
.fieldsGrouping(JMS_DATA_SPOUT,new Fields("eventGrouping"));
builder.setBolt("jmsBolt", new JMSBolt(),2).setNumTasks(2).fieldsGrouping("esperBolt", new Fields("eventName"));
I have also seen Trident for "exactly-once" semantics. I am not fully convinced this would solve this issue however.