0

Say I have a list of 10,000 lines of string that needs to be processed by 100 worker scripts.

I would like as many of the 100 scripts to run synchronously as possible.

Once a worker script is finished with one line, it should process the next available line that is not currently being processed by another worker script.

If a worker script fails on a line, it will skip it and move on to the next available line that is not currently being processed by another worker script.

A worker script at anytime may be unavailable for unknown amount of time.

Now assume that out of the first initial 100 worker scripts, any given worker script may become unavailable (either crashing or taking too long with a current data) but will become available again after some time down the road. It may become unavailble again and may take too long to become available again for the duration of processing 10,000 lines.

How to process all 10,000 lines with initial 100 worker scripts synchronously running but any of which may become unavailable and after some unknown random time, it could become available again ready to process.

I would imagine something like doing a loop for all 10,000 lines, and another script to poll all available workers at intervals, and launch those workers synchronously.

I am uncertain how I would approach this problem.

KJW
  • 15,035
  • 47
  • 137
  • 243

1 Answers1

1

The producer/consumer pattern is pretty helpful for situations like this. I explained it a bit more over here.

That said, if your situation is really that straightforward, simpler techniques may be more appropriate, like partitioning the data evenly.

Also, I assume that you're not expecting to see a 100x speedup as your HW surely wouldn't support that...

Of course if I've completely misunderstood and you actually want to process each string 100x (i.e. each script does something different), then please clarify.

Community
  • 1
  • 1
Michael Haren
  • 105,752
  • 40
  • 168
  • 205