Storm max spout pending

Question

This is a question regarding how Storm's max spout pending works. I currently have a spout that reads a file and emits a tuple for each line in the file (I know Storm is not the best solution for dealing with files but I do not have a choice for this problem).

I set the topology.max.spout.pending to 50k to throttle how many tuples get into the topology to be processed. However, I see this number not having any effect in the topology. I see all records in a file being emitted every time. My guess is this might be due to a loop I have in the nextTuple() method that emits all records in a file.

My question is: Does Storm just stop calling nextTuple() for the Spout task when topology.max.spout.pending is reached? Does this mean I should only emit one tuple every time the method is called?

John Gilmore · Accepted Answer · 2014-10-26T05:26:14.977

18

Exactly! Storm can only limit your spout with the next command, so if you transmit everything when you receive the first next, there is no way for Storm to throttle your spout.

The Storm developers recommend emitting a single tuple with a single next command. The Storm framework will then throttle your spout as needed to meet the "max spout pending" requirement. If you're emitting a high number of tuples, you can batch your emits to at most a tenth of your max spout pending, to give Storm the chance to throttle.

edited Oct 26 '14 at 05:26

answered Jul 21 '14 at 20:56

John Gilmore

2,775
1
21
18

can you help on this stackoverflow.com/questions/34327617/… ? the topology.max.spout.pending 50000000 is there is a problem ? – Jan 14 '16 at 22:11

score 17 · Answer 2 · edited Oct 23 '18 at 10:00

Storm topologies have a max spout pending parameter. The max spout pending value for a topology can be configured via the “topology.max.spout.pending” setting in the topology configuration yaml file. This value puts a limit on how many tuples can be in flight, i.e. have not yet been acked or failed, in a Storm topology at any point of time. The need for this parameter comes from the fact that Storm uses ZeroMQ to dispatch tuples from one task to another task. If the consumer side of ZeroMQ is unable to keep up with the tuple rate, then the ZeroMQ queue starts to build up. Eventually tuples timeout at the spout and get replayed to the topology thus adding more pressure on the queues. To avoid this pathological failure case, Storm allows the user to put a limit on the number of tuples that are in flight in the topology. This limit takes effect on a per spout task basis and not on a topology level.(source) For cases when the spouts are unreliable, i.e. they don’t emit a message id in their tuples, this value has no effect. One of the problems that Storm users continually face is in coming up with the right value for this max spout pending parameter. A very small value can easily starve the topology and a sufficiently large value can overload the topology with a huge number of tuples to the extent of causing failures and replays. Users have to go through several iterations of topology deployments with different max spout pending values to find the value that works best for them.

can you help on this http://stackoverflow.com/questions/34327617/supervsior-still-hasnt-start ? the topology.max.spout.pending 50000000 is there is a problem ? — , Dec 17 '15 at 06:42

score 1 · Answer 3 · answered Sep 02 '15 at 03:47

One solution is to build the input queue outside the nextTuple method and the only thing to do in nextTuple is to poll the queue and emit. If you are processing multiple files, your nextTuple method should also check if the result of polling the queue is null, and if yes, atomically reset the source file that is populating your queue.

Storm max spout pending

3 Answers3

Linked