Node.js data processing distribution

Question

I'm in need of a strategy to distribute data processing using node.js. I'm trying to figure out if using a worker pool and isolate groups of tasks in these workers is the best way, or using a pipe/node-based system like http://strawjs.com/ is the way to go.

The steps I have are the following (For a single job):

Extract a zip-file containing GIS ShapeFiles
Convert the files to GeoJSON using ogr2ogr
Denormalize the data in the GeoJSON file
Transform the data to a format I use in MongoDB
Upsert the data into a MongoDB collection

The main problem is that I don't really know how to merge data from the different GeoJSON files when using a pipe/node-based system like straw.

I understand how to do the work in worker pools. But I don't know how to distribute the workers over several machines.

I've tried the naive way of doing it in a single thread on a single machine using the async module. This works good for small sets of data. But in production I'm need to be able to support millions of documents on a pretty frequent interval.

The reasons behind using node.js is that we already have a solid infrastructure for scaling node.js processes and we use node.js for almost every aspect of our production environment.

I'd suggest you to Google for "nginx + node.js cluster", read some articles, draw some UML deployment, sequence diagrams (solution proposals) and use it to make your question less broad. Question too broad and no code provided are downvote and close reasons. How to merge JSON objects or JSON arrays was already answered at Stack Overflow, e.g. here http://stackoverflow.com/questions/10384845/merge-two-json-objects-in-to-one-object — xmojmr, Jun 30 '14 at 18:53

score 3 · Answer 1 · answered Sep 11 '14 at 00:59

Author of Straw here.

You can run Straw pretty easily on multiple machines.

Set up a dedicated Redis server, and run a Straw topology on any number of separate worker machines, with them using that Redis server (via the config you pass in to the Topo).

By using named pipes in your topologies you can connect the separate machines together. It's basically the same as if they were running on a single machine.

A useful technique is to have multiple Straw nodes getting their input from the same pipe. They will then load-balance themselves automatically.

Also, Straw uses a separate OS process per node, so on a multicore machine it will make better use of the cores than a single Node.js process.

Let me know if you need any more info or help.

Node.js data processing distribution

1 Answers1