I'm in need of a strategy to distribute data processing using node.js. I'm trying to figure out if using a worker pool and isolate groups of tasks in these workers is the best way, or using a pipe/node-based system like http://strawjs.com/ is the way to go.
The steps I have are the following (For a single job):
- Extract a zip-file containing GIS ShapeFiles
- Convert the files to GeoJSON using ogr2ogr
- Denormalize the data in the GeoJSON file
- Transform the data to a format I use in MongoDB
- Upsert the data into a MongoDB collection
The main problem is that I don't really know how to merge data from the different GeoJSON files when using a pipe/node-based system like straw.
I understand how to do the work in worker pools. But I don't know how to distribute the workers over several machines.
I've tried the naive way of doing it in a single thread on a single machine using the async module. This works good for small sets of data. But in production I'm need to be able to support millions of documents on a pretty frequent interval.
The reasons behind using node.js is that we already have a solid infrastructure for scaling node.js processes and we use node.js for almost every aspect of our production environment.